Towards a Data Infrastructure for Socio-economic Research on Europe: Improving Access to Official Microdata at the National and the European Level

Franz Kraus

This contribution focuses on access problems to official microdata, their adverse impact on the building up of a European data infrastructure for socio-economic research and the need for improved co-operation between the academic community and the statistical offices. It sketches the access and infrastructure situation in Western as well as in Central Eastern Europe. Finally, it deals at length with the recently adopted Statistical Law of the EU and its implications concerning access to Community microdata in general and to the data of the European Community Household Panel in particular.Europe has entered an era of profound socio-economic change, caused by a number of global devel-opments and region-specific problems, such as the ageing of the European population. Economic and social policies are facing a challenge: a wide range of institutional innovations and adaptations are required that will have an immediate impact on the life chances and living conditions of major segments of society. To monitor and understand the complex dynamics of change, we need, among other things, encompassing information that is comparable across time and space. Information on inputs, processes and outputs has to be supplemented by information on values, attitudes and perceptions. It is also evident that sustained collection of longitudinal information is of particular relevance. Nobody doubts that access to these data is necessary at the level of individual units (persons, households, establishments) to understand the complex dynamics of change and to guide policies in an appropriate way. Over the last decades, both academia and government information services (statistical offices, etc.) have intensified their efforts to keep pace with the need for adequate international data (Tannenbaum/ Mochmann 1994; Eurostat 1994c, 1995b, 1995c, 1996; Flora/Kraus/ Noll/Rothenbacher 1994). Yet, aca-demia and government information services took different paths and progressed at different speeds.1

1. Access to microdata at the national level for research on Europe

Within the social sciences, academia has for a very long time been coined by its national orientation and disciplinary fragmentation. Over the years, various types of service institutions (data archives, documentation services, survey research centres) were established in many countries of Europe to facilitate access to information and sharing of data within and across borders (Tannenbaum/Mochmann 1994). Gradually, comparative databases were established as well; initially, by making third-party data available (Eurobarometer of the European Commission since the mid-70s), then increasingly through own collection programmes (the European and World Values Surveys in the early 80s and the early 90s, the International Social Survey Programme and the Socio-economic Household Panels since the late 80s). With the European Social Survey, a new, continuous, international survey programme has been launched recently (European Science Foundation 1998), and the first experimental survey might be carried out in 1999. With the exception of the Household Panels, all survey programmes are supplements to official statistics, focusing on attitudes, values and perceptions. Access to anonymised microdata has never been a problem in academia, and with the possibility of networking, high priority has been given to facilitating access to available resources. With the foundation of the European Consortium for Political Research, the establishment of a European Science Foundation and, more recently, the European Consortium for Sociological Research, major organisational efforts have been made to promote comparative research on a wider scale.

Yet, despite all these and other achievements not mentioned here, official statistics remains the major source for comparative research on Europe. Despite all administrative fragmentations and functional differentiations of statistics (Als 1993a, 1993b; Garonna/Sofia 1997), the post-war period shows rapidly increasing co-operation between national statistical offices and international standardizers of statistics, such as the United Nations' Statistical Commission, the Conference of European Statisticians, OECD and, increasingly, the statistical office of the European Union (Eurostat). As a result, the last two decades have witnessed a hitherto unexperienced convergence of enumeration programmes, concepts and methods - although considerable differences still persist (Flora/ Kraus/ Noll/ Rothenbacher 1994). With the expansion of membership and increased competences of the European Commission, Eurostat is more and more becoming a key actor also with respect to the collection of data. Earnings and labour cost surveys, the labour force survey, and, recently, the European Household Panel are important examples. Statistical programmes, have to be adapted to the new situation, and, at least in the field of social statistics, programmes will have to undergo major revisions (Euro-stat/ISTAT 1997). Comprehensive social surveys, for example, will definitely be needed for socio-economic reporting in the near future (Tuinen 1995; Vogel 1994), even though the Member States have not given Eurostat (arguing her competence and obligation in this field) the go-ahead by now. Yet, because of the still limited subject coverage and the incomplete membership space of the European Union, national statistics remains an indispensable source for research on Europe.

Scope and quality of comparative research and of research on European integration depend on access to data at the level of individual statistical units to a large degree. In contrast to academia, with its established tradition of data sharing and the early reconciliation of the inherent conflict between data protection and data access, government agencies usually have greater difficulties in allowing third parties to access the collected microdata. During the 80s, in most West European countries the possible misuse of microdata has become a matter of heated debate in the public, and access to microdata became more and more restricted in most European countries. To a certain extent, this is understandable. Statistical offices depend heavily on sustained co-operation of their statistical universe. Confidence of the interviewees is a costly good and data protection a crucial tool to ensure this confidence. Yet, in principle, the same holds true for all data collecting organisations, and there are several examples in official statistics, both outside and inside Europe, which demonstrate that broad access for scientific purposes can be balanced quite well with data protection needs.

In the meanwhile, the situation has somewhat improved. The idea of 'de-facto anonymity' of data - i.e. that anonymised data are to be considered as safe if disclosure would require an unreasonable high input of resources - has been passed by the European Council and ratified by its Member States. Efficient procedures for anonymisation have been developed and extensively discussed in several international conferences (cf. Eurostat 1994a, 1995a and a recent conference in Luxembourg). As a result, de-facto anonymity as a basic criterion for data protection is now more or less universally accepted. This is certainly one of the major reasons why access gradually becomes less restrictive in most countries, recently also in Germany (ZUMA 19972,3).

Yet, considering Europe today, access still is too restrictive and too diverse to render official statistics what it has the potential for: a major tool for research on Europe. Conditions vary from country to country with respect to extent, form and organisaton of access. Table 1 gives a rough overview of the current situation. It also includes information on the role of national social science services (data archives or special microdata services) as potential mediators between statistical offices and the scientific community (Eurostat 1993, 1994b and updates4).

Extent of access

Access to official microdata for comparative purposes, though subject to certain constraints regarding privacy, has been made mandatory in two European countries: Italy and the United Kingdom. Only in the United Kingdom, access in practice is almost universal, though. For comparative work, however, the most striking feature is that virtually no source is accessible across all countries, and that access to key sources, such as population censuses and establishment surveys are most restricted. The situation is further complicated by differences in anonymisation procedures (removal of information, aggregation of details, etc.), additionally impairing suitability and comparability of national sources for comparative research.

Form of access

Concerning the form of access, there is an extreme variety. Other differences relate to available variables. Some countries provide pre-fabricated files with a certain subset of variables only (public use or scientific use files), others give full access to all variables, and some allow resp. require a selection by the user. In several countries, access conditions are so restrictive that cumulation of knowledge and experience becomes rather difficult. Frequently, the use of data is limited in time, and the data must be destroyed afterwards. In one country, access to microdata is possible only within the secure area of the statistical office, as it is, for example, the case with access to the ECHP-data at Eurostat. What usually remains in such cases is the programme setup to input and label the raw data. A special situation can be observed in Scandinavia and partly also in the Netherlands, where register-based data are increasingly used to diminish interviewees' burden caused by surveying. However, in some of these countries, such as Finland, researchers can, in principle, receive synthetic microdata where survey records have been linked with register data.

Organisation of access: the current role of national science organisations

Concerning the organisation of access, practice differs not only with regard to procedural detail and complexity, sometimes leading to considerable delays in research schedules. It differs even more with respect to the leeway regulations leave for the accumulation of knowledge and experience. Negative consequences on efficacy and effectiveness of research can be quite considerable in the case of large-scale surveys and panel studies where one usually has to invest a considerable amount of time to make data ready-to-use. Here, the major obstacle derives from restrictions to the duration of use and to controlled dissemination of ready-to-use data to other researchers. These certainly are essential precautions against potential misuse of data. Their adverse impact, however, on the productivity of research can hardly be overrated. Placing national science organisations in between government agencies and individual users as an interface would allow to stick to these data protection measures without having to buy the negative impact on research. Knowledge and experience in data documentation and analysis could accumulate without having to sacrifice data protection measures at the level of the individual user.

Table 1: Access to official microdata: some basic characteristics of current regulations
In pro of access mediation through social science services

Comparative research on Europe needs a network of infrastructural support nodes to make efficient use of the vast amount and variety of survey data that is continuously collected by government agencies in Europe. In a period of constrained budgetary resources, discourse and co-operation between the academic community and the statistical offices must be intensified to avoid duplication of efforts and to improve access. Comprehensive access to microdata is a major prerequisite for high-quality research. Some of the quarrels that came up recently within the context of the European Community Household Panel, in fact, might have to do with access conditions. In case of conflict, academia tends to give priority to its own surveys, one of the main reasons being access to data at the level of individual units.

Facilitated access to more microdata could be arranged if scientific organisations with national responsibilities were allowed to act as mediator between academic users and statistical offices. Subject to a set of basic principles and rigorous codes of behaviour, these organisations could become a forum for the accumulation of knowledge and experience with large-scale microdata - without sacrifying data protection measures such as limiting time of use and prohibiting uncontrolled dissemination to others.

In many European countries, social science data archives with national obligations have been established during the last three decades. In some countries, local archives are obliged to serve the sciences at a voluntary level (Spain: Centro de Investigaciones Sociológicas5; Italy: Archivio Dati e Programmi per le Scienze Sociali6), and in some countries, national archives are short of establishment (Finland, Republic of Ireland). Despite all the differences in size and function, this network of archives and resource centres, organised in the Council of European Social Science Data Archives (CESSDA7), might provide the nucleus for such a European data infrastructure.

The map below shows the current situation with respect to the availability of country-wide services and their possible use as an interface towards an enhanced use of official microdata.

Considering the services available in Austria (WISDOM8), France (LASMAS, cf. CNRS/Lasmas 1993), the Netherlands (NWO/WSA9), Norway (NSD10) and the United Kingdom (ESRC Data Archive11) and their impact upon high-quality research, the payoff of a close co-operation between statistical offices and social science-based service institutions becomes obvious.

In Great Britain, for example, the ESRC Data Archive and the Office for National Statistics (ONS) have finally arrived at a very flexible agreement facilitating scientific access to virtually all government surveys to a hitherto unprecedented extent (Sylvester 1996). ONS provides data free of charge to the Data Archive which acts on behalf of ONS, serving the academic community not only with the microdata but with a wide variety of additional services as well. Extensive and high-quality documentation, ready-to-use data, training in the analysis of large-scale government surveys, source-specific user-groups, and regular introduction into major survey programmes (together with the Royal Statistical Society) promote the use of official microdata and the continuous exchange of information between individuals and statistical sectors. Other resource centres of ESRC provide additional services, ranging from meta-information on surveys (CASS12) to user-friendly online-access to strategic large-scale surveys (MIDAS13), including small-area microdata of the population census and, finally, to integrated macrodata on Europe (R.CADE14). Publications, both in number and substance, clearly demonstrate how universal scientific access to anonymous official microdata pays off for all parties: the statistical office, the academic community, the government and the public at large. The success story of the transnational microdata and research centre CEPS15 (Luxembourg) gives further evidence to the considerable payoff of an improved co-operation between research services and statistical offices. Through the provison of (indirect) access to harmonized microdata of national family budget surveys and labour force surveys, and direct access to academic household panel surveys, high-quality comparative research on socio-economic key issues has developed, which had not been possible without such an access.

It is obvious that there is still much space left for improving the situation. Proper storage, extensive documentation, and easy access to meta-information and to data of a wide variety of surveys are essential prerequisites of a European data infrastructure for comparative research. Social science data archives are experienced in documenting data and providing controlled, but user-friendly access to them. In many countries (Denmark, Germany, Italy, Sweden, Switzerland, and soon also Finland and eventually Ireland), proper infrastructural services are available but not used for supporting research with official microdata. In those countries where no infrastructural services are available at all, research-oriented data centres might be developed to take over similar functions. With some imagination and political will, co-operation models could be developed to the benefit of everybody.

In many countries, such a solution would require changes in statistical legislation, as for example in Germany, where brokerage by third-party organisations is prohibited by law. The advantages of such a solution are, however, obvious and could be shared both by the academic community and the government statistics sector. The adding of value (through increased use of data, both at the level of data quality, documentation and analysis), the adding of legitimacy (through evidencing the need of data not only for the purpose of efficient governing, but also for scientific analyses), the adding of knowledge diffusion between the two sectors (through increasing needs of co-operation), the promotion of democratic discourse (through the counterbalance of politically independent science), and, last but not least, a more economic use of resources are obvious examples.

2. Access to European community surveys: the new statistical law of the European Union

The statistical office of the European Communities, Eurostat, has over the years gained statistical competences in many fields that are relevant for socio-economic research. Considerable efforts have been invested in harmonising relevant microdata of the Member States. These microdata-bases could become a major tool for socio-economic research on Europe - provided that the academic community could gain access to them.

Figure 1

Previous legislation concerning access to Community microdata

However, access to European Community microdata (i.e., microdata transmitted to Eurostat for the purpose of Community statistics) was virtually impossible until recently, due to the 'Council Regulation No. 1588/90 of 11 June 1990 on the transmission of data subject to statistical confidentiality to the Statistical Office of the European Communities' (Celex Document No. 390R1588). In this law, 'confidential data' was defined as 'data declared confidential by the Member States in line with national legislation or practices governing statistical confidentiality' (dto., article 1,1). Access to data declared confidential was restricted to officials of Eurostat: 'Confidential statistical data transmitted to the SOEC [i.e., Eurostat, the author] shall be accessible only to officials of the SOEC and may be used by them exclusively for statistical purposes' (article 4,2). Persons outside of Eurostat could be granted access only if they were 'working on the premises of the SOEC under contract, in special cases and exclusively for statistical purposes' (article 5,3). Access to confidential Community microdata outside the Commission was basically limited to councellers of Eurostat and to researchers carrying out research under the premise of Eurostat (i.e. Eurostat councellers and some TSER-projects). The regulation was limited to the treatment of confidential data, giving, in principle, some leeway for granting access to non-confidential microdata. When starting the European Community Household Panel project (ECHP), Eurostat from the very beginning opted for granting access to microdata for scientific purposes. And in fact, several of the Member States participating in the project, did allow dissemination of their data for scientific use.

The new regulation

In the meantime, the legal framework regulating access to Community microdata has been modified substantially, opening up new prospects for a general access to Community microdata. With the 'Council Regulation No. 322/97 of 17 February 1997 on Community Statistics' (Celex document 397R0322), the so-called Statistics Law, the definition of statistical confidentiality was replaced and access to Community microdata for scientific purposes introduced for the first time. The regulation no longer leaves it to the Member States to determine whether transmitted data are to be considered confidential: instead, it obliges all Member States to establish a uniform set of minimum standards for protecting individual level data against unlawful disclosure (preamble, items 3 and 4; chapter III, article 10; chapter V, article 13). Then, article 13 defines statistical confidentiality in relative terms (i.e., in implicite association with the notion of 'de-facto anonymity, as laid down by the European Council in her recom-mendation on scientific research and statistics):

'1. Data used by national authorities and the Community authority for the production of Community statistics shall be considered confidential when they allow statistical units to be identified, either directly or indirectly, thereby disclosing individual information.

To determine whether a statistical unit is identifiable, account shall be taken of all the means that might reasonably be used by a third party to identify the said statistical unit' (Council Regulation No. 322/97, chapter V).

As all Member States are obliged to introduce appropriate measures against unlawful disclosure of microdata, Community microdata, prima facie, are to be considered as non-confidential. In principle, access can be denied only in those cases where the transmission of confidential data (in the sense introduced by the Statistics Law) is necessary for the production of specific Community statistics (article 14). In those cases, it is up to the national authorities whether to provide access to these data for scientific purposes.

Concerning the confidentiality of sources required for Community Statistics, European law is now going to replace national law or practice. In case of conflict, however, the new 'Statistics Law' can be applied only to those enumeration programmes that are implemented on the basis of EU law. Data transmitted on a voluntary basis (such as the microdata of the familiy budget survey) are, strictly speaking, not covered. Furthermore, the new regulation, which has the character of framework law, has now to be transformed into source-specific legislation. Yet, Eurostat's statistical competences have been growing in the past, and they will continue to do so in the future. In the recent past, the competences of the Commission have been considerably expanded to cover the social field (Treaties of Maastricht and Amsterdam, as well as corresponding legislation on the Community Statistical Programme, and, last but not least, the Commission Decision of 21 April 1997 on the role of Eurostat as regards the production of Community statistics, cf. 97/281/EC, Celex document 397D0281). As a consequence, the competences of Eurostat will gradually expand as well: in order to fulfill her obligations, the EU's statistical office will increasingly need data in the form of microdata, and existing sources will have to be supplemented by new ones.

For the short-term, however, the 'Statistics Law' has quite adverse implications for those (few) out-side users that already had access to Community microdata under the previous legislation. Because the Regulation did not fix the standards for anonymization, the criteria have to be operationalized before access to microdata can be granted. Obviously, concrete operationalizations are partially source-dependent. It is therefore very likely that consensus with the data providers must be reached for each single type of survey.

Prospects for accessing ECHP data via Eurostat16

Due to this situation, access to ECHP data is currently limited to the data providers, Eurostat, and their councellers. Research projects under the premise of Eurostat currently have no access to the micro-data - irrespective of their contractual obligations. It is evident that nobody can be satisfied with this situation. In autumn 1997, Eurostat, herself interested in in- and extensive scientific use of the data from the very beginning, suggested a set of minimum standards to the ECHP-project members to ensure de-facto anonymity of the ECHP microdata. Consensus building is still going on and finally also has to be adopted by the 'Committee on Statistical Confidentiality'. By now one hopes that the original timescale can be maintained and access to the ECHP data in form of an integrated longitudinal database be granted by fall 1998 (Eurostat/ECHP 1998, 1; 1996). Due to legal problems, however, German ECHP data will very likely not be accessible outside of Germany.

Whether these news on the forthcoming accessibility of ECHP microdata will, in the end, turn out to be good news depends, however, on the restrictiveness of the anonymisation procedures (aggregation of codes, removal of information) finally adopted. There is some concern among scientists that proced-ures might go beyond what is acceptable from a scientific point of view, depriving this first multi-dimensional EU-wide survey of its analytical potential it was so highly welcomed for17. Many scientists and scientific institutes are further concerned about the low transparency of the procedure and the apparent lack of institutionalized channels for adequate interest articulation during the process of consensus-building.

3. On the way towards an infrastructure for socio-economic research on Europe?

For a variety of reasons (available resources, degree of institutionalized international co-operation, pro-gramme continuity, etc.) official microdata must be the cornerstone of a European data infrastructure for comparative research. Academic microdata are an essential complement, but they cannot substitute official surveys. Both sectors of statistics move in a direction that strengthens research on Europe and on European integration.

The academic sector contributes to it through the initiation and continuous expansion of repetitively conducted international survey programmes and panel studies, the efficient use of new technologies to provide easy access to their information (creation of a virtual European Data Archive and project on Networked Social Science Tools and Resources18), through developing data archives further into European resource centres facilitating and supporting comparative research and, last but not least, through increased international networking of transdisciplinary socio-economic research. National research funds devote increasing parts of their resources to promoting the comparative orientation of infrastructural services and research activities. Additional support comes from the European Community, particularly through the provision of funds to enhance the use of already existing large-scale facilities in the socio-economic sciences19 (TMR programme), and, though to a more limited degree, the building-up of new research infrastructures within TSER's programme for horizontal activities20. Though in slightly different form, support will be continued also in the 5th framework programme (cf. CORDIS, COM 98(305)) and apparently extended to now also foster the international pooling of small network resources (Ziegler 1998).

The government information sector contributes to it through increasing standardization of national enumeration programmes, concepts and methods, through the gradual move to longitudinal enumerations, data integration and innovative social statistics, and the consequent use of modern networking technologies to provide information in a user-friendly way within and across national borders. The European Union contributes to these efforts through funding of a variety of Information Technology and Informatics research projects under the umbrella of DOSIS (Eurostat 1997). Easy access to comprehensive meta-information, both of national statistical institutes and of Eurostat, promoted by the European Union, seems to be not far away (Eurostat 1997, 1993), and will certainly promote comparative research.

The major problem, however, will remain: how to manage providing access to the extremely rich information collected by statistical agencies at the level of individual units (persons, households, firms) without violating the individual's right to privacy. In the medium-term at least, this problem will hardly be solvable by technical means (cf. for example, Dosis-Project ADDSIA, Eurostat 1997). What is necessary is a new consensus that brings two basic rights into a more balanced relation: the right of privacy and the right of information. It is time to think about current regulations of access to microdata not only in terms of potential misuse - but also in terms of the cost of non-access.

