Presented at W3C Workshop on Improving Access to Financial Data on the Web, 5-6 October 2009.
This paper is also available in MS Word or PDF formats.
Consumers of financial information come in many guises from personal investors looking for that value for money share, to government regulators investigating corporate fraud, to business executives seeking competitive advantage over their competition. While the particular analysis performed by each of these information consumers will vary, they all have to deal with the explosion of information available from multiple sources including, SEC filings, corporate press releases, market press coverage, and expert commentary. Recent economic events have begun to bring sharp focus on the activities and actions of financial markets, institutions and not least regulatory authorities. Calls for enhanced scrutiny will bring increased regulation and information transparency
While extracting information from individual filings is relatively easy to perform when a machine readable format is utilized (for example, using XBRL, the eXtensible Business Reporting Language), cross comparison of extracted financial information can be problematic as descriptions and accounting terms vary across companies and jurisdictions. Across multiple sources the problem becomes the classical data integration problem where a common data abstraction is necessary before functional data use can begin.
Within this paper we discuss the challenges in converging financial data from multiple sources. We concentrate on integrating data from multiple sources in terms of the abstraction, linking, and consolidation activities needed to consolidate data before more sophisticated analysis algorithms can examine the data for the objectives of particular information consumers (for e.g. competitive analysis, regulatory compliance, or investor analysis). We base our discussion on several years researching and deploying data integration systems in both the web and enterprise environments.
Prominent providers of public domain financial information is the Securities and Exchange Commission (SEC) which through their EDGAR web site makes freely available a wide range of personal and company filings. Data from the SEC ranges from information about executives reporting the sale of equity in their companies (Form 4) to detailed annual reports (Form 10-K). Filings are in the older SGML format, free-text (HTML and PDF), or more recently XBRL format. A wide range of other governmental or intergovernmental organisations publish data in various formats. For example, central banks use RSS-CB for publishing currency exchange rate data. The United Nations, World Bank, Eurostat, and the OECD are working towards a standard data format for publishing statistical information in SDMX, Statistical Data and Metadata Exchange format. Finally, there is considerable (financial) information, regarding companies and their executives available in Wikipedia which DBpedia publishes in the web standard RDF (Resource Description Format).
A large number of information consumers have varying degrees of interest in financial data. The integration and augmenting of financial information is of significant benefit for financial and business analysis as the following three use cases illustrate. The principle behind each is that XBRL information extracted from SEC filings receives a semantic metadata lift allowing the data to then be published in RDF format. The Rhizomik project, OpenLink's sponger, and Dave Raggert's recent efforts are examples of first offerings for mapping XBRL to RDF. Once in RDF it can be linked and augmented with additional information from other financial data extracted from sources such as those previously mentioned as a consolidated financial 'mash-up'.
Financial data in electronic form comes in various formats on a continuum of structure, ranging from unstructured text, to highly structured XML data, to graph-structured data in RDF. The goal is to allow for users to analyse the underlying datasets and derive actionable knowledge from the aggregated and integrated data.
Our data integration approach comprises several stages:
The different data processing steps from the raw data via graph-structured and integrated data to the end user are illustrated in the following figure.
In the following section we give a characterisation of the type of data we identified for addressing the outlined use case scenarios. We focus here on publicly available datasets; however, our approach is equally applicable to specialized in-house sources and formats.
The main obstacle preventing easy integration into a holistic dataset is that these types of data are in different formats, separate, and not interlinked. For a classification of data according to the types of information encoded (e.g. time series, taxonomic data) we refer the interested reader to Ben Shneiderman's paper.
A typical system architecture for a data integration system is depicted in the following figure. The system consists of two components: a data preparation and integration phase to convert and bring data from different sources into a common format and a query and user interface module that operates over the integrated dataset. We will discuss each of the components later on.
Integrating data from multiple sources provides a common data platform from which search, browsing, analysis, and interactive visualisation can take place. Consolidation in semantic web terms leads to an aggregated source view or a coherent graph amalgamated, 'mashed up' from potentially thousands of sources, where an entity centric approach can provide a powerful single view point allowing information filtering and cross analysis. The key challenge for any information system operating in this space is the need to perform a semantic integration of structured and unstructured data from the open Web and monolithic data sources such as XML database dumps and large static datasets. This can be achieved using a hybrid data integration solution which amalgamates the data warehousing and on-demand approaches to integration.
From this integration emerges a large graph of RDF entities with inter-relations and structured descriptions of entities: archipelagos of information coalesce to form a coherent knowledge base. Entities are typed according to what they describe: people, locations, organizations, publications as well as documents; entities have specified relations to other entities: people can work for companies, people know other people, people author documents, organisations are based in locations, and so on.
This step involves the process of collecting and integrating data from a plethora of sources in a multitude of formats such as native RDF, RSS streams, HTML, MS Office, PS, and PDF documents. This information can be located across multiple information systems such as databases, document repositories, government sites, company sites, and news sites in order to collect this information web crawlers or database wrappers can be employed.
In order to avoid having the knowledge contribution of entities split over numerous instances the system will need to connect sources that may describe the same data on a particular entity. Within one of our case studies the results of analysing the connections between people and organizations from SEC filings (Form 4) identified 69,154 people connected to 80,135 organizations. The same analysis performed on database describing companies produced 122,013 people connected to 140,134 organizations. Once collected the base dataset needed to be enrich and interlinked using entity consolidation (a.k.a. object consolidation). In order to avoid having the knowledge contribution of entities split over numerous instances the system will need to connect sources that may describe the same data on a particular entity.
Semantic analysis within these systems will be closely tied to the purpose of the system such as fraud detection, competitive analysis, profit projections, etc. While the exact algorithm used for analysis will vary, a number of common services will be needed to assist in the examination and query of the data including local index creation and management, distributed query processing, and runtime consolidation of results. Data based upon item consolidation rather than the XML document bases approach of XBRL provides not only insight but underpinned by linked data backbone (semantic web technology) allows a means of data querying that conventional tools do not. SPARQL, the semantic query language allows queries/questions such as the following to be asked:
Analysis of the consolidated data sources could also be performed by communities or groups of analysts who could be employed to annotate the data further to raise irregularities, for example. An example of this "crowd sourcing" approach to data analysis is the Guardian's site that asks readers to tag and report UK MPs expense claims for further investigation.
There are a number of challenges to address when integrating data from different sources. We classify these challenges into four groups: text/data mismatch, object identifiers and schema mismatch, abstraction level mismatch, data accuracy.
The single largest barrier to developing sophisticated semantic analysis methods and algorithms for use in financial analysis, fraud or regulatory activities is the ability to integrate multiple financial data sources into a more holistic and transparent data set.
In this paper we have highlighted the data integration challenges facing the provision of transparent financial information and where semantic standards and approaches can be of direct benefit. An architectural approach is presented based upon previous case study experiences along with the remaining challenges in the areas. We feel that leveraging financial source data in the manner described will help distil actionable knowledge that can be used for driving business decisions and policy.
The work presented in this paper has been funded in part by Science Foundation Ireland under Grant No. SFI/08/CE/I1380 (Lion-2).