Enabling Spatio-Temporal Search in Open Data

Intuitively, most datasets found on governmental Open Data portals are organized by spatio-temporal criteria, that is, single datasets provide data for a certain region, valid for a certain time period. Likewise, for many use cases (such as, for instance, data journalism and fact checking) a pre-dominant need is to scope down the relevant datasets to a particular period or region. Rich spatio-temporal annotations are therefore a crucial need to enable semantic search for (and across) Open Data portals along those dimensions, yet – to the best of our knowledge – no working solution exists. To this end, in the present paper we (i) present a scalable approach to construct a spatio-temporal knowledge graph that hierarchically structures geographical as well as temporal entities, (ii) annotate a large corpus of tabular datasets from open data portals with entities from this knowledge graph, and (iii) enable structured, spatio-temporal search and querying over Open Data catalogs, both via a search interface as well as via a SPARQL endpoint, available at data.wu.ac.at/odgraphsearch/


Introduction
Open Data has gained a lot of popularity and support by governments in terms of improving transparency and enabling new business models: Governments and public institutions, but also private companies, provide open access to raw data with the goal to present accountable records [1], for instance in terms of statistical data, but also in fulfillment of regulatory requirements such as, e.g., the EU's INSPIRE directive. 3 The idea to provide raw data, instead of only human-readable reports and documents, is mainly driven by providing direct, machine-processable access to the data, and  [2,3].
With the advent of knowledge graphs traditional web search recently has been revolutionized in that search results can be categorized, browsed and ranked according to well-known concepts and relations, which cover typical search scenarios in search engines. But these scenarios are different for Open Data: in our experience, dataset search needs to be targeted from a different angle than keyword-search (alone). Intuitively, most datasets found in Open Data -as it is mostly regional/national censusbased -are organized by spatio-temporal scopes, that is, single datasets provide data for a certain region, and are valid for a certain time period; our goal is to cover exactly these two dimensions which are prevalent in Open Data: Indeed, our approach successfully annotates geospatial information in 75% of the datasets, and temporal information for almost 58% of all datasets (cf. Section 4.3 for the detailed evaluation). Also, Kacprzak et al. [4] recently confirmed the relevance and need of spatio-temporal annotations and search across Open Data portals: They analyzed the query logs of four data portals (including data.gov.uk) wrt. different aspects and characteristic and list temporal and geospatial queries as the top-two query types.
We argue that -just like for regular Web search -knowledge graphs can be helpful to significantly improve search; specifically to our use case we aim at constructing a spatio-temporal knowledge graph from publicly available sources: In fact, the ingredients for building such a knowledge graph of geographic entities as well as time periods and events exist already on the Web of Data, although they have not yet been integrated and applied -in a principled manner -to the use case of Open Data search.
Herein, we present a scalable approach to (i) construct a spatio-temporal knowledge graph that hierarchically structures geographical entities, as well as temporal entities, (ii) annotate a large corpus of tabular Open Data, currently holding datasets from eleven European (governmental) data portals, (iii) enable structured, spatio-temporal search over Open Data catalogs through this spatio-temporal knowledge graph, available at http://data.wu. ac.at/odgraphsearch/.
In more detail, we make the following concrete contributions: • A detailed construction of a hierarchical knowledge graph of geo-entities and temporal entities and links between them.
• A scalable labelling algorithm for linking open datasets (both on a dataset-level and on a record-level) to this knowledge graph.
• Indexing and annotation of datasets and metadata from 11 Open Data portals from 10 European countries and an evaluation of the annotated datasets to illustrate the feasibility and effectiveness of the approach.
• A prototypical search interface, consisting of a web user interface allowing faceted and fulltext search, a RESTful JSON API that allows programmatic access to the search UI, as well as API-access to retrieve the indexed dataset and respective RDF representations • A SPARQL endpoint that exposes the annotated links and allows structured search queries.
• Code, data and a description on how to re-run our experiments, which we hope to be a viable basis for further research extending our results, are available for re-use (under GNU General Public License v3.0). 4 The remainder of this paper is structured as follows: In the following Section 2 we introduce (linked) datasets, repositories and endpoints to retrieve relevant temporal and spatial information. Section 3 provides a schematic description of the construction and integration of these sources into our base knowledge graph -a constructed knowledge graph which serves as a basis for annotation and linking of the datasets; its actual realization in terms of implementation details is fully explained in Appendix A. In Section 4 we present the algorithms to add spatio-temporal annotations to datasets from Open Data portals, and evaluate and discuss the performance (in terms of precision and recall based on a manually generated sample) and limitations of our approach. The vocabularies and schema of our RDF data export are explained in Section 5 and the back-end, the user interface and the SPARQL endpoint (including example queries) are presented in Section 6. We provide related and complementary approaches in Section 7, and eventually we conclude in Section 8.

Background
When people think of spatial and temporal context of data, they usually think of concepts rather than numbers, that is "countries" or "cities" instead of coordinates or a bounding polygon, or an "event" or "time period" instead of e.g. start times end times. In terms of dataset search that could mean someone searching for datasets containing information about demographics for the last government's term (or in other words between the last two general elections).
In order to enable such search by spatio-temporal concepts, our goal is to build a concise, but effective knowledge base, that collects the relevant concepts from openly available data into a coherent knowledge graph, for both (i) enabling spatio-temporal search within Open Data portals and (ii) interlinking Open Data portals with other datasets by the principles of Linked Data.
The following section gives an overview of datasets and sources to construct the base knowledge graph of temporal-and geo-entities, namely the geo-data sources GeoNames, OpenStreetMap and NUTS, the knowledge bases Wikidata and DBpedia, and the periods/events dataset PeriodO.
GeoNames.org. The GeoNames database contains over 10 million geographical names of entities such as countries, cities, regions, and villages. It assigns unique identifiers to geo-entities and provides a detailed hierarchical description including countries, federal states, regions, cities, etc. For instance, the GeoNames-entity for the city of Munich 5 has the parent relationship "Munich, Urban District", which is located in the region "Upper Bavaria" of the federal state "Bavaria" in the country "Germany", i.e. the GeoNames database allows us to extract the following hierarchical relation for the city of Munich: The relations are based on the GeoNames ontology 6 which defines administrative divisions (first-order gn:A, second-order gn:A.ADM2, until gn:A.ADM5) 7 for countries, states, cities, and city districts/sub-regions. In this work we make use of an RDF dump of the GeoNames database, 8 which consists of alternative names and hierarchical relations of all the entities.
OpenStreetMap (OSM). OSM 9 was founded in 2004 as a collaborative project to create free editable geospatial data. The map data is mainly produced by volunteers using GPS devices (on foot, bicycle, car, ..) and later by importing commercial and government sources, e.g., aerial photographies. Initially, the project focused on mapping the United Kingdom but soon was extended to a worldwide effort. OSM uses four basic "elements" to describe geo-information: 10 • Nodes in OSM are specific points defined by a latitude and longitude. • Ways are lists of nodes that define a line. OSM ways can also define areas, i.e. a "closed" way where the last node on the way equals to the first node.
• Relations define relationships between different OSM elements: They either split long ways into smaller segments (for easier processing), or build complex objects, e.g., a route is defined as a relation of multiple ways (such as highway, cycle route, bus route) along the same route.
• Tags are used to describe the meaning of any elements, e.g., there could be a tag highway=residential 11 (tags are represented as key-value pairs) which is used on a way element to indicate a road within settlement.
There are already existing works which exploit the potential of OSM to enrich and link other sources. For instance, in [5] we have extracted indicators, such as the number of hotels or libraries in a city, from OSM to collect statistical information about cities. Likewise, the software library Libpostal 12 uses addresses and places extracted from OSM: it provides street address parsing and normalization by using machine learning algorithms on top of the OSM data. The library converts free-form addresses into clean normalized forms and can therefore be used as a pre-processing step to geo-tagging of streets and addresses. We integrate Libpostal in our framework in order to detect and filter streets and city names in text and address lines.
Sources to obtain Postal codes and NUTS codes. Postal codes are regional codes consisting of a series of letters (not necessarily digits) with the purpose of sorting mail. Since postal codes are countryspecific identifiers, and their granularity and availability strongly varies for different countries, there is no single, complete, data source to retrieve these codes. The most reliable way to get the complete dataset is typically via governmental agencies (made easy, in case they publish the codes as open data). 13 Another source worth mentioning for 11 cf.
https://wiki.openstreetmap.org/wiki/Tag: highway=residential 12 https://medium.com/@albarrentine/ statistical-nlp-on-openstreetmap-b9d573e6cc86, last accessed 2017-09-12 13 For instance, the complete list of Austrian postal codes is available as CSV via the Austrian Open Data matching postal codes is GeoNames: it provides a collection of postal codes for several countries and the respective name of the places/districts. 14 Partially, postal codes for certain countries are available in the knowledge bases of Wikidata and DBpedia (see below) for the respective entries of the geo-entities (using "postal code" properties). However, we stress that these entries are not complete, i.e., not all postal codes are available in the knowledge bases as not all respective geo-entities are present, and also, the codes' representation is not standardized. NUTS (French: nomenclature des unitès territoriales statistiques). Apart from national postal codes another geocode standard has been developed and is being regulated by the European Union (EU). It references the statistical subdivisions of all EU member states in three hierarchical levels, NUTS 1, 2, and 3. All codes start with the twoletter ISO 3166-1 [6] country code and each level adds an additional number to the code. The current NUTS classification lists 98 regions at NUTS 1, 276 regions at NUTS 2 and 1342 regions at NUTS 3 level and is available from the EC's Webpage. 15 Also worth mentioning in this context -as an additional source for statistical and topographical maps on NUTS regions -are the basemaps developed at the European level by Eurostat, available as REST services. 16 Wikidata and DBpedia. These domainindependent, multi-lingual, knowledge bases provide structured content and factual data. While DBpedia [7] is automatically generated by extracting information from Wikipedia, Wikidata [8], in contrary, is a collaboratively edited knowledge base which is intended to provide information to Wikipedia. These knowledge bases already partially include links to GeoNames, NUTS identifier, and postal code entries, as well as temporal knowledge for events and periods, e.g., elections, news events, and historical epochs, which we also harvest to complete our knowledge graph. PeriodO. The PeriodO project [9] offers a gazetteer of historical, art-historical, and archaeological periods. The user interface allows to query and filter the periods by different facets. Further, the authors published the full dataset as JSON-LD download 17 and re-use the W3C skos, time and dcterms:spatial vocabularies to describe the temporal and spatial extend of the periods. For instance, the (shortened) PeriodO entry in Figure 1 describes the period of the First World War.

Base Knowledge Graph Construction
The previous section listed several geo-data repositories as well as datasets containing time periods and event data -some already available as Linked Data via an endpoint -which we use in the following to build up a spatio-temporal knowledge graph: Section 3.1 describes the extraction and integration of geospatial, and Section 3.2 of temporal knowledge. The remaining paper uses an additional color coding of turquoise for introducing temporal and blue for geospatial properties.
Herein, we describe the composition of the graph by presenting conceptual SPARQL CONSTRUCT queries. This means that (most of) the presented queries cannot be executed because either there is no respective endpoint available or the query is not feasible and times out. While this section shall serve as a schematic specification of the constructed graph, we detail the actual realization of the queries in Appendix A. Still, we deem the use of these conceptual SPARQL CONSTRUCT useful as a mechanism to declaratively express knowledge graph compilation from Linked Data sources, following Rospocher et al.'s definition, who describe knowledge graphs as "a knowledge-base of facts about entities typically obtained from structured repositories" [10]. 18 Figure 2 lists all the namespaces that are used in the SPARQL queries throughout the paper and in the Appendix.

Spatial Knowledge
Our knowledge graph of geo-entities is based on the GeoNames hierarchy, where we extract • geo-entities and their labels, • links to parent entities and particularly the containing country, • coordinates in terms of points and (if available) geometries in terms of polygons, • postal codes (again, if available), and • sameAs-links to other sources such as DBpedia, OSM, or Wikidata (again, if available). 18 As a side remark, such queries could for instance be used to declaratively annotate the provenance trail of knowledge graphs compiled from other Linked Data sources, e.g. expressed through labeling the activity to extract the relevant knowledge with PROV's [11] prov:wasGeneratedBy property with a respective SPARQL CONSTRUCT query.
The respective SPARQL CONSTRUCT query 19 in Figure 3 displays how the hierarchical data can be extracted from the GeoNames datasets -loaded into a SPARQL endpoint -for a selected country ?c: The GeoNames Ontology 20 allows to retrieve the relevant data for our knowledge graph per country, by replacing ?c in this query with a concrete country URI, such as http://sws.geonames. org/2782113/ (for Austria). The GeoNames RDF data partially already contains external links to DBpedia (using rdfs:seeAlso) which we model as equivalent identifiers using owl:sameAs. The hierarchy is constructed using the gn:parentFeature property. As GeoNames offers various different properties containing names, we extract all official English and (for the moment) German names, as we will use those later on for fueling our search index. The query in Figure 4 then displays how we integrate the information in Wikidata into our spatial knowledge graph. In particular, Wikidata serves as a source to add labels and links for postal codes (gn:postalCode) and NUTS identifiers (wdt:P605) for the respective geographical entities. Further, we again add external links (to OSM and Wikidata itself) that we harvest from Wikidata as owl:sameAs relations to our graph. The query in Figure 5 conceptually shows how and which data we extract for certain OSM entities into our knowledge graph. We note here that OSM does not provide an RDF or SPARQL interface, but the idea is that we -roughly -perceive and process data returned by OSM's Nominatim API in JSON as JSON-LD; details and pre-processing steps in Appendix A.2 below.

Temporal Knowledge
As for temporal knowledge, we aim to compile into our knowledge graph a base set of temporalentities (that is, named periods and events from Wikidata and PeriodO) where we want to extract • named events and their labels, • links to parent periods that they are part of, again to create a hierarchy, • temporal extent in terms of a single beginning and end date, and • links to a spatial coverage of the respective event or period (if available).
We observe here that temporal knowledge is typically less consolidated than geospatial knowledge, i.e. "important" named entities in terms of periods or events are not governed by internationally agreed and nationally governed structures such as borderagreements in terms of spatial entities. Even worse, cross-cultural differences, such as different calendars or even time zones, add additional confusion. We still believe that the two integrated sources, which cover events of common interest in a multilingual setting on the one hand (Wikidata), and historical periods and epochs from the literature on the other (PeriodO), provide a good starting point.
In the future, it might be useful to also index news events, or recurring periods or points in time, such as public holidays, that occur regularly. However, we did not find any structured datasets available as linked data for that, so, we have to defer these to future work, or respectively, the creation of respective structured datasets as a challenge for the community. One obvious existing starting point here would be the work by Rospocher et al. [10] and the news events datasets they created in the EU Project NewsReader. 21 For the moment, we did not consider this work due to its fine granularity, which in our opinion is not needed in a majority of Open Data search use cases. Again, we model the knowledge graph extraction and construction in terms of conceptual SPARQL queries: We use the query in Figure 6 to extract events information from Wikidata. Note, that this query times out on the public Wikidata endpoint. Therefore, in order to extract the relevant events and time periods as described in Figure 6, we converted a local Wikidata dump to HDT [12], extracted only the relevant triples for the query, materialized the path expressions, and executed the targeted CONSTRUCT query over these extracts on a local endpoint; the full details are provided in Appendix A.3. We do not just extract existing triples from the source, but try to aggregate/flatten the representation to a handful of well-known predicates from Dublin Core (prefix dcterms:) and the OWL time ontology (prefix time:). Likewise, we use the query in Figure 8 to extract periods from the PeriodO dataset, again using the same flattened representation. To execute this query, in this case we could simply download the available PeriodO dump into a local RDF store.
Note that in these queries -in a slight abuse of the OWL Time ontology -we "invented" the properties timex:hasStartTime and timex:hasEndTime that do not really exist in the original OWL time ontology. This is a compromise for the desired compactness of representation in our knowledge graph, i.e. these are mainly introduced as shortcuts to avoid the materialization of unnecessary blank nodes in the (for our purposes too) verbose notation of OWL Time. A proper representation using OWL Time could be easily reconstructed by means of the CONSTRUCT query in Figure 7.   For this purpose we define our own vocabulary extension of the OWL Time ontology, for the mo-ment, under the namespace http://data.wu.ac. at/ns/timex#.

Dataset Labelling
In this section we first describe the algorithms to add geospatial annotations (Section 4.1) and to extract temporal labels and periodicity patterns (Section 4.2) and subsequently evaluate and discuss the performance -in terms of precision and recall based on a manually evaluated sample -and limitations of our approach in Section 4.3.
In order to add spatial and temporal annotations to Open Data resources we use the CSV files and metadata from the resources' data portals as signals. The metadata descriptions and download links are provided by our Open Data Portal Watch framework [13] which monitors and archives over 260 data portals, and provides APIs to retrieve their metadata descriptions in an homogenized way using the W3C DCAT vocabulary [14]. Regarding the meta-information, we look into several available metadata-fields: we consider the title, description, the tags and keywords, and the publisher. For instance, the upper part of Figure 9 displays an example metadata description. It holds cues in the title and the publisher field (cf. "Veröffentlichende Stelle" -publishing agency) and holds a link to the actual dataset, a CSV file (cf. lower part in Figure 9), which we download and parse.  Geo-information in metadata and CSVs. Example dataset from the Austrian data portal: https://www.data.gv.at/katalog/dataset/ 4d9787ef-e033-4c4f-8e50-65beb0730536

Geospatial labelling
The geospatial labelling algorithm uses the different types of labels in our knowledge graph to an-notate the metadata and CSV files from the input data portals.

CSVs
Initially, the columns of a CSV get classified based on regular expressions for NUTS identifier and postal codes. While the NUTS pattern is rather restrictive, 22 the postal codes pattern has to be very general, potentially allowing many false positives. Basically, the pattern is designed to allow all codes in the knowledge graph, and to filter out other strings, words, and decimals. 23 Potential NUTS column (based on the regular expression) get mapped to the existing NUTS identifier. If this is possible for a certain threshold (set to 90% of the values) we consider a column as NUTS identifier and add the respective semantic labels. In case of potential postal codes the algorithm again tries to map to existing postal codes, however, we restrict the set of codes to the originating country of the dataset. This again results in a set of semantic labels which get accepted with a 90% threshold.
The labelling of string columns, i.e. set of words or texts, uses all the labels from GeoNames and OSM and is based on the following disambiguation algorithm: Value disambiguation. The algorithm in Figure 10 shows how we disambiguate a set of string values based on the surroundings. As surroundings we consider all the values of a single column, however, in case of multiple labels in a row we use these as additional signals. E.g., consider a CSV row with the values "Austria", "Linz", and "Hauptplatz 1", i.e., a row specifying and address, which we clearly want to consider for disambiguation.
First, the function get context(values) counts all potential parent GeoNames entities for all of the values. To disambiguate a single value we use these counts and select the GeoNames candidate with the most votes from the context values' parent regions; cf. disamb value(value). The function get geonames(value) returns all potential GeoNames entites for an input string. Additionally, we use the origin country of the dataset (if available) as a restriction, i.e., we only allow GeoNames labels from the matching country.
For instance, in Figure 9 the Austrian "Linz" candidate gets selected in favor of the German "Linz" because the disambiguation resulted in an higher score based on the matching predecessors "Upper Austria" and "Austria" for the other values in the column (Steyr, Wels, Altheim, ...).
If no GeoNames mapping was found the algorithm tries to instantiate the string values with OSM labels from the knowledge graph. Again, the same disambiguation algorithm is applied, however, with the following two preprocessing steps for each input value: 1. In order to better parse addresses, we use the Libpostal library (cf. Section 2) to extract streets and place names from strings. 2. We consider the context of a CSV row, e.g., if addresses in CSVs are separated into dedicated columns for street, number, city, state, etc. To do so we filter the allowed OSM labels by candidates within any extracted regions from the metadata description or from the corresponding CSV row (if geo-labels available).

Metadata descriptions
The CSVs' meta-information at the data portals often give hints about the respective regions covering the actual data. Therefore, we use this additional source and try to extract geo-entities from the titles, descriptions and publishers of the datasets:  1. As a first step, we tokenize the input fields, and remove any stopwords. Also, we split any words that are separated by dashes, underscores, semicolon, etc. 2. The input is then grouped by word sequences of up to three words, i.e. all single words, groups of two words, ..., and the previously introduced algorithm for mapping a set of values to the GeoNames labels is applied (including the disambiguation step). Figure 9 gives an example dataset description found on the Austrian data portal data.gv.at. The labelling algorithm extracts the geo-entity "Upper Austria" (an Austrian state) from the title and the publisher "Oberösterreich". The extracted geoentities are added as additional semantic information to the indexed resource.

Temporal labelling
Similarly to the geospatial cues, temporal information in Open Data comes in various forms and granularity, e.g., as datetime/timespan information in the metadata indicating the validity of a dataset, or year/month/time information in CSV columns providing timestamps for data points or measurements.

CSVs
To extract potential datetime values from the datasets we parse the columns of the CSVs using the Python dateutil library. 24 This library is able to parse a variety of commonly used date-time patterns (e.g., ''January 1, 2047'', ''2012-01-19'', etc.), however, we only allow values where the parsed year is in the range of 1900 and 2050. 25 For both sources of temporal information, i.e. metadata and CSV columns, we store the minimum and maximum (or start and end time) value so that we can allow range queries over the annotated data.
Datetime periodicity patterns. The algorithm in Figure 11 displays how we estimate any pattern of periodicity of the values in a column for a set of input datetime values. Initially, we check if all the values are the same (denoted as static column), e.g., a column where all cells hold "2018". Then we sort the values; however, note that this step could lead to unexpected annotations, because the underlying pattern might not appear in the unsorted column.
We compute all differences (deltas) between the input dates and check if all these deltas have approximately -with 10% margin -the same length. We distinguish daily, weekly, monthly, quarterly, and yearly pattern; in case of any other recurring pattern we return other.

Metadata descriptions
We extract the datasets' temporal contexts from the metadata descriptions available at the data portals in two forms: (i) We extract the published and last modified information in case the portal provides dedicated metadata fields for these. (ii) We use the resource title, the resource description, the dataset title, the dataset description, and the keywords as further sources for temporal annotations. However, we prioritize the sources in the above order, meaning that we use the temporal information in the resource metadata rather than the information in the dataset title or description. 26 24 https://dateutil.readthedocs.io/en/stable/ 25 The main reason for this restriction is that any input year easily yields to wrong mappings of e.g. postal codes, counts, etc. 26 For instance, consider a dataset titled "census data from 2000 to 2010" that holds several CSVs titled "census data The datetime extraction from titles and descriptions is based on the Heideltime framework [15] since this information typically comes as natural text. Heideltime supports extraction and normalization of temporal expressions for ten different languages. In case the data portal's origin language is not supported we use English as a fallback option.

Indexed Datasets & Evaluation
Our framework currently contains CSV tables from 11 European data portals from 10 different countries, cf. Table 1. We manually selected European governmental data portals (potentially also using NUTS identifier in their datasets) which are already monitored by the Open Data Portal Watch [13]. Note, that the notion of datasets on these data portals (wrt. Table 1) usually groups a set of resources; for instance, typically a dataset groups resources which provide the same content in different file formats. A detailed description and analysis of Open Data portals' resources can be found in [13]. The statistics in Table 1    Here we focus on evaluating the annotated geoentities, and neglect the temporal annotations with the following two main reasons: First, the datetime detection over the CSV columns is based on the standard Python library dateutil. The library parses standard datetime formats (patterns such as yyyy-mm-dd, or yyyy) and the potential errors here are that we incorrectly classify a numerical column, e.g., classifying postal codes as years. As a very basic pre-processing, where we do not see a need for evaluation, we reduce the allowed values to the range 1900-2050 (with the drawback of potential false negatives), however, using the distribution of the numeric input values [16] would allow a more informed decision. Second, the labelling of metadata information is based on the temporal tagger Heideltime [15] which provides promising evaluations over several corpora.
Manual inspection of a sample set. To show the performance and limitations of our labelling approach we have randomly selected 10 datasets per portal (using Elasticsearch's built-in random function 27 ) and from these again randomly select 10 rows, which resulted in a total of 101 inspected CSVs, 28 i.e. 1010 rows (with up to several dozen columns per CSV). Sampling datasets from different portals allows us to assess and compare the performance for different countries and data publishing strategies. The median percentage of annotated records (i.e. rows) per dataset (across all indexed datasets) is 92%; our sample is representative, in this respect, with a median of 88% annotated rows. The median number of total rows of all indexed datasets is lower (86 rows) than within the evaluated sample (287 rows). However, the overall number of rows varies widely with a mean of 1742 rows across all datasets, which indicates a large variety and non-even distribution of dataset sizes (between 1 and 221k rows).
As for the main findings, in the following let us provide a short summary; all selected datasets and their assigned labels can be found at https://github.com/sebneu/geolabelling/ tree/eu-data/jws_evaluation. Initially, we have to state that this evaluation is manually done by the authors and therefore restricted to our knowledge of the data portals' origin countries and their respective language, regions, sub-regions, postal codes, etc. For instance, we were able to see that our algorithm correctly labelled the Greek postal codes in some of the test samples from the Greek data portal data.gov.gr, 29 but that we could not assign 27 https://www.elastic.co/guide/en/elasticsearch/ guide/current/random-scoring.html, last accessed 2018-04-01 28 We only selected CSVs with assigned geo-entities -to provide a meaningful precision measure -which resulted in < 10 files for the smaller data portals, e.g., opingogn.is, and therefore in 101 files in total.
29 E.g., https://github.com/sebneu/geolabelling/blob/ eu-data/jws_evaluation/data_gov_gr/0.csv, the datasets use "T.K." in the headers to indicate these codes. the corresponding regions and streets. 30 However, as we are not able to read and understand the Greek language (and the same for the other non-English/German/Spanish portals) we cannot fully guarantee any potential mismatches or missing annotations that we did not spot during our manual inspections.  We categorize the datasets' labels by assessing the following dimensions: are there any correctly assigned labels in the dataset (c), are there any missing annotations (m), and did the algorithm assign incorrect links to GeoNames (g) or OSM (o); a result overview is given in Table 3.
Out of 101 inspected datasets we identified in 87 CSVs correct annotations. In particular, for the Spain and the Greek data portal only in 50% of the test samples there were correct links, while for the 9 other indexed data portals we have a near to 100% rate. Regarding any missing annotations, we identified 53 datasets where our algorithm (and also the completeness of our spatial knowledge graph) needs improvements. For instance, in some datasets from the Netherlands' data portal 31 and also the Slovakian portal 32 we identified street names and addresses that could potentially mapped to OSM entries.
Regarding incorrect links there were only 12 files with wrong GeoNames and 5 files with wrong OSM annotations. An exemplary error that we observed here was that some files 33 contain a column with the value "Norwegen" ("Norway"): Since the file is provided at a German data portal, we incorrectly labelled the column using a small German region Norwegen instead of the country, because our algorithm prefers labels from the origin country of 30 The Greek data portal uses the Greek letters in their metadata and CSVs which would require a specialized label mapping wrt. lower-case mappings, stemming, etc. 31 E.g.,https://github.com/sebneu/geolabelling/tree/ eu-data/jws_evaluation/data_overheid_nl/4.csv 32 E.g., https://github.com/sebneu/geolabelling/tree/ eu-data/jws_evaluation/data_gov_sk/3.csv 33 https://github.com/sebneu/geolabelling/blob/ eu-data/jws_evaluation/offenedaten_de/0.csv the dataset. Another example that we want to consider in future versions of our labelling algorithm is this wrong assignment of postal codes: 34 We incorrectly annotated a numeric column with the provinces of Spain (which use two-digit numbers as postal codes). Table 4 displays the precision, recall, and F1 score for all sample records, i.e. for all annotated cells of the 101 sample CSVs. We want to emphasize that these results do not say anything about the quality of the data portals themselves. As mentioned in the above paragraph, again, we can see in Table 4 that the Greek (data.gov.gr) and the Spain data portal (datos.gob.es) have a notable drop in precision 35 while for the other portals the total precision is still at 86%. The total recall is at 73%, which again shows that our approach needs further improvements in terms of missed annotations and completeness of the spatial knowledge graph.

Export RDF
We make our knowledge graph and RDFized linked data points from the CSVs available via a SPARQL endpoint. Figure 12 displays an example 34 https://github.com/sebneu/geolabelling/blob/ eu-data/jws_evaluation/datos_gob_es/7.csv 35 There are streets in OSM that are labelled by an identifier (e.g. "2810 254 527") and, coincidentally, match columns in Greek datasets. Regarding the Spain datasets we incorrectly matched several columns containing the numbers 1-50: We mapped these to the fifty provinces of Spain, which use the numbers 1-50 as ID/zip codes. In future work we plan to include simple rules and heuristics to avoid such simple errors. extract of the RDF export of the knowledge graph. The sources of the aggregated links between the different entities and literals in our graph are indicated in the figure; we re-use the GeoNames ontology (gn:) for the hierarchical enrichments generated by our algorithms (see links [m]), and owl:sameAs for mappings to OSM relations and NUTS regions, cf. Figure 12.
Annotated data points. We export the linked data points from the CSVs in two ways: First, for any linked geo-entity <g> in our knowledge graph, we add triples for datapoints uniquely linked in CSV resources (that is, values appearing in particular columns) by the following triple schema: if the entity <g> appears in a column in the given CSV dataset, i.e., the value V ALU E in that column has been labeled with <g>, we add a triple of the form <g> <u#col> "V ALU E" .
That is, we mint URIs for each column col appearing in a CSV accessible through a URL u by the schema u#col, i.e., using fragment identifiers. The column's name col is either the column header (if a header is available and the result is a valid URI) or a generic header using the columns' index column1, column2, etc. These triples are coarse grained, i.e. they do not refer to a specific row in the data. We chose this representation to enable easy-to-write, concise SPARQL queries like for instance: Second, a finer grained representation, which we also expose, provides exact table cells where a certain geospatial entity is linked to, using an extension of the CSVW vocabulary [17]: note that the CSVW vocabulary itself does not provide means to conveniently annotated table cells in column col and row n which is what we need here, so we define our own vocabulary extension for this purpose (for the moment, under the namespace http://data.wu.ac.at/ns/csvwx#).
We use the CSVW class csvw:Cell for an annotated cell and add the row number and value (using csvw:rownum and rdf:value) to it. We extend CSVW by the property csvwx:cell to refer from a csvw:Column (using again the fragmented identifier u#col) to a specific cell, and the property csvwx:rowURL to refer to the CSV's row (using the schema u#row=n). Here, the property csvwx:refersToEntity (cf. the blue line) connects a specific cell to the labelled geo-entity <g>. Analogously, in case of available (labelled) temporal information for a cell, we use the property csvwx:hasTime; cf. the turquoise line in the following example: Moreover, we denote the geospatial scope of the column itself by declaring the type of entities within which geographic scope appearing in the column. The idea here is that we annotate -on column level -the least common ancestor of the spatial entities recognized in cells within this column. E.g., <u#col> csvwx:refersToEntitiesWithin <g 1 > .
with the semantics that the entities linked to col in the CSV u all refer to entities within the entity g 1 (such that g 1 is the least common ancestor in our knowledge graph. This could be seen as a shortcut/materialization for a CONSTRUCT query as in Figure 13. Obviously, this query is very inefficient and we rather compute these least common ancestors per column during labeling/indexing of each column. CSV on the Web. In order to complete the descriptions of our annotations in our RDF export, we describe all CSV resources gathered from the annotated Open Data Portals and their columns using the CSV on the Web (CSVW) [17] vocabulary, re-using the following parts of the CSVW schema. Firstly, we use the following scheme to connect our aforementioned annotations to the datasets: Then, we enrich this skeleton with further CSVW annotations that we can extract automatically from the respective CSV files. Figure 14 displays an example export for a CSV resource. The blank node :csv represents the CSV resource which can be downloaded at the URL at property csvw:url. The values of the properties dcat:byteSize and dcat:mediaType are values of the corresponding HTTP header fields. The dialect description of the CSV can be found via the blank node :dialect at property csvw:dialect and the columns of the CSV are connected to the :schema blank node (describing the csvw:tableSchema of the CSV).

Search & Query Interface
Our integrated prototype for a spatio-temporal search and query system for Open Data currently consists of three main parts: First, the geo-entities DB and search engine in the back end (Section 6.1), second the user interface and APIs (Section 6.2), and third, access to the above described RDF exports via an SPARQL endpoint (Section 6.3).

Back End
All labels from all the integrated datasets and their corresponding geo-entities are stored in a look-up store, where we use the NoSQL key-value database MongoDB. It allows an easy integration of heterogeneous data sources and very performant look-ups of keys (e.g., labels, GeoNames IDs, postal codes, etc. in our case).
Further, we use Elasticsearch to store and index the processed CSVs and their metadata descriptions. In our setup, an Elasticsearch document corresponds to an indexed CSV and consists of all cell values of the table (arranged by columns), the potential geo-labels for a labelled column, metadata of the CSV (e.g., the data portal, title, publisher, etc.), the temporal annotations, and any additional labels extracted from the metadata.
The different components all have an impact on the performance and efficiency of the system. The indexing performance depends on the MongoDB database for label look-ups, the time-tagger Heideltime, and, Elasticsearch for storing the datasets. To show the efficiency and scalability of our approach, we timed the indexing of a sample of 2160 datasets, with an average file size of ∼50kB. The total processing time for all dataset was 16.8 hours -deactivated parallelization, including download, parsing, and processing time -whereof 8 hours were consumed by the labelling algorithms. Notably, the median total time for indexing a dataset is only 1.2 seconds, with a median time of 0.7 seconds for the labelling algorithms. 36

User interface
The user interface, available at http://data.wu.ac.at/odgraphsearch/, allows search queries for geo-entities but also full-text matches. Note, that the current UI implements geo-entity search using auto-completion of the input (but only suggesting entries with existing datasets) and supports full-text querying by using the "Enter"-key in the input form. The screenshot in Figure 15 displays an example query for the Austrian city "Linz". The green highlighted cells in the rows below show the annotated labels, for instance, the annotated NUTS2 code "AT31" in the second result in Figure 15.
Also, we add facets to filter datasets relevant to a particular period either by auto-completion in a separate field to filter the time period by a period/event label, or by choosing start and end dates via sliders (cf. Figure 15). The users can decide to apply this filter to temporal information in title and description of the dataset, or the CSV columns.
By separating the search at these two levels we do not mix dates within the data and the metadata level. For instance, the metadata could have date/time that refers to the present such as created, modified, etc. while in the datasets there can be a mixture of dates referring to temporal information or events (e.g., a column of birth dates).
The chosen geo-entities and durations which are fixed via these lookups in our search index through the UI are passed on as parameters as a concrete geo-ID and/or start&end-date to our API, which we describe next. Additionally, the web interface provides APIs (http://data.wu.ac.at/odgraphsearch/api) to retrieve the search results, all indexed datasets, and the RDF export per dataset. To allow programmatic access to the search UI we offer the following HTTP GET API: /api/v1/get/datasets?l={GeoIDs} &limit={limit}&offset={of f set} &start={startDate}&end={endDate} &mstart={startDate}&mend={endDate} &periodicity={dateT imeP attern} &q={keyword} The API takes multiple instances of geo IDs, that is, GeoNames or OSM IDs (formatted as osm:{ID}) using parameter l, a limit and an offset parameter, which restricts the amount of items (datasets) returned, and an optional white space separated list of keywords (q) as full-text query parameter to enable conventional keyword search in the metadata and header information of the datesets. To restrict the results to a specific temporal range we im-plemented the parameters mstart, mend (for filtering a time range from the metadata-information), and start, end (for the min and max values of date annotations from CSV columns). The parameter periodicity allows to filter for datetime periodicity patterns such as "yearly", "monthly", or "static" (in case there is only a single datetime value in this column), cf. Section 4.2.1 for a detailed description of the periodicity patterns.
The output consists of a JSON list of documents that contain the requested GeoNames/OSM IDs or any tables matching the input keywords.

SPARQL endpoint
We offer a SPARQL endpoint at http://data.wu.ac.at/odgraphsearch/sparql where we provide the data as described in Section 5. Currently, as of the first week of April 2018, the endpoint contains 88 million triples: 15 million hierarchical relations using the gn:parentFeature relation, 11768 CSVs (together with their CSV on the Web descriptions), where we added a total of 5 million geo-annotations using the csvwx:refersToEntity property, and 1.3 million datetime-annotations using the csvwx:hasTime annotation.
Example queries. The first example in Figure 16 lists all datasets from Vienna, using the csvwx:refersToEntity metadata annotation, and only lists CSVs where there exists a column with dates within the range of the last Austrian legislative period, using the Wikidata entities of the past two elections.  Figure 16: Example SPARQL query using the spatial property csvwx:refersToEntity and the temporal properties timex:hasStartTime and timex:hasEndTime.
The next example query in Figure 17 combines text search for time periods with a structured query for relevant data; it looks for information of datasets about a time period before the 2nd World War, called the "Anschluss movement" (i.e., the preparation of the annexation of Austria into Nazi Germany) and queries for all available CSV rows where a date within this period's range (1918( -1938, and a geo-entity within the period's spatial coverage location (i.e. Austria) occurs.  Figure 17: Example SPARQL query combining text search for a time period with a structured query for datasets within the period's temporal and spatial coverage.
GeoSPARQL. GeoSPARQL [18] extends SPARQL to a geographic query language for RDF data. It defines a small ontology to represent geometries (i.e., points, polygons, etc.) and connections between spatial regions (e.g., contains, part-of, intersects), as well as a set of SPARQL functions to test such relationships. The example query in Figure 18 (namespaces as in Figure 2) uses the available polygon of the Viennese district "Leopoldstadt" to filter all annotated data points within the borders of this district.
While we do not yet offer a full GeoSPARQL endpoint for our prototype yet (which we leave to # filter all annotated data points # within the polygon of Leopoldstadt FILTER(geof:sfWithin(?g, ?polygon)) } Figure 18: Example GeoSPARQL query over using the available geometries -not yet available via the endpoint. a forthcoming next release), our RDFized datasets and knowledge graph is GeoSPARQL "ready", i.e. having all the geo-coordinates and polygons in the endpoint using the GeoSPARQL vocabulary; an external GeoSPARQL endpoint could already access our data using the SERVICE keyword and evaluate the GeoSPARQL specific functions locally, or simply import our data.

Related Work
The European Union identified the issue of insufficient description of public datasets and conducted several activities towards metadata standards across European portals: The DCAT Application Profile for Data Portals in Europe (DCAT-AP) 37 aims towards the integration of datasets from different European data portals. In its current version (v1.1) it extends the existing DCAT schema [14] by a set of additional properties, e.g., it allows to specify the version and the period of time for a dataset. Going one step further, the INSPIRE directive 38 and the GeoDCAT-AP specification 39 have more restrictive requirements for spatial metadata, i.e., they model spatial coverage either as a bounding box, or using a geographic identifier; notably, the specification also mentions GeoNames as potential identifiers. The main barrier with these approaches is a lacking adoption: We could not see a broad use of these standards across the portals (neither in terms of vocabulary nor in complete spatial descriptions) and therefore could not 37 https://joinup.ec.europa.eu/release/dcat-ap-v11 38 https://inspire.ec.europa.eu/ 39 https://joinup.ec.europa.eu/release/geodcat-ap/ v101 further use them. In principle, our approach distinguishes from these activities by not only having the spatio-temporal descriptions but also interlinking the datasets to external sources, i.e. to GeoNames, Wikidata, and OSM. Also, these standards only allow descriptions on datasets level, whereas we annotate the data on record level as well.
The 2013 study by Janowicz et al. [19] gives an overview of Semantic Web approaches and technologies in the geospatial domain. Among the Linked Data repositories and ontologies listed in the article we also find the GeoNames ontology (cf. Section 2), the W3C Geospatial Ontologies [20], and the GeoSPARQL Schemas [18]. However, when looking into the paper's listed repositories, most of them (6/7) were not available, i.e. offline, which seems to indicate that many efforts around Geo-Linked data have unfortunately not been pursued in a sustainable manner.
The 2012 project LinkedGeoData [21] resulted in a Linked Data resource, generated by converting a subset of OpenStreetMap data to RDF and deriving a lightweight ontology from it. In [22] the authors describe their attempts to further connect GeoNames and LinkedGeoData, using string similarity measures and geometry matching. LinkedGeoData is also listed in [19] as a geospatial Linked Data repository. Their work is complementary to ours: they also perform an interlinking with DBpedia, GeoNames, and a mapping from OpenStreetMap, but do not integrate general Open Data resources. That is, their mappings are driven on generic entity linkage between these existing data sources, whereas we crate a bespoke new knowledge graph out of the existing spatial and temporal linked data sources for our use case. The recent effort "Sophox" 40 can be seen as a conceptual continuation of the LinkedGeoData project: actually intended as a cleanup tool, it allows SPARQL queries over OSM elements and tags. In the future we could also consider directly using the SPARQL interface of Sophox.
The GeoKnow project [23] is another attempt to provide and manage geospatial data as Linked Data. GeoKnow provides a huge toolset to process these datasets, including the storage, authoring, interlinking, and geospatially-enabled query optimization techniques.
The project PlanetData (2010 to 2014), funded by the European Commission, released an RDF mapping of the NUTS classifications 41 [24] using the GeoVocab vocabulary. 42 This dataset models the hierarchical relations of the regions, provides labels and polygons. Unfortunately, the project does not include external links to GeoNames, or Wikidata, except for the country level, i.e. there are only 28 links to the corresponding GeoNames entries of the EU member states.
Complementary to our approach to access street addresses via OSM, Open Addresses 43 is a global collection of address data sources, which could be considered for future work as an additional dataset to feed into our knowledge graph. The manually collected and homogenized dataset consists of a total of 478M addresses; street names, house numbers, and post codes combined with geographic coordinates, harvested from governmental datasets of the respective countries.
A conceptually related approach, although not focusing on geo-data, is the work by Taheriyan et al. [25], who learn the semantic description of a new source given a set of known semantic descriptions as the training set and the domain ontology as the background knowledge.
In [26] Paulheim provides a comprehensive survey of refinement methods, i.e., methods that try to infer and add missing data to a graph, however, these approaches work on graphs in a domain independent setting and do not focus on temporal and spatial knowledge. Still, in some sense, we view our methodology of systematic Knowledge Graph aggregation from Linked Data sources via declarative, use-case specific, minimal mappings as potentially complementary to the domain-independent methods mentioned therein, i.e., we think in future works, such methods should be explored in combination.
Most related wrt. the construction of the temporal knowledge graph is the work by Gottschalk and Demidova [27] (2018): they present a temporal knowledge graph that integrates and harmonizes event-centric and temporal information regarding historical and contemporary events. In contrast to [27] we additionally integrate data from PeriodO [9] and focus on periods in a geospatial context. This work is built upon [28] where the authors ex- tract event information from the Wikipedia Current Events Portal (WCEP). In future work we want to connect the resource from [27], since the additional data extracted from the WCEP and Wik-iTimes interface is in particular interesting for our framework. Similar to [27], [29] gather temporal information from knowledge bases, and additionally from the Web of documents. The extracted facts get then mapped and merged into time intervals.
In [10], Rospocher et al. build a knowledge graph directly from news articles, and in [30] by extracting event-centric data from Wikipedia articles. These approaches work over plain text (with the potential drawback of noisy data) while we integrate existing structured sources of temporal information; therefore these are complementary/groundwork to our contributions.
Modelling and querying geospatial information has also been discussed conceptually in the literature: [31] present an ontology design pattern derived from time geography, and [32] discuss the requirements of a geospatial search platform and present a geospatial registry.

Conclusions
Governmental data portals such as Austria's data.gv.at or the UK's data.gov.uk release local, regional and national data to a variety of users (citizens, businesses, academics, civil servants, etc.). As this data is mainly collected as part of census collections, infrastructure assessments or any other, secondary output data, these resources almost always contain or refer to some kind of geographic and temporal information; for instance, think of public transport data, results of past elections, demographic indicators, etc. Search across these dimensions seems therefore natural, i.e., we have identified the spatial and temporal dimensions as the crucial, characterizing dimensions of datasets on such data portals.
In order to enable such search and to integrate these datasets in the LOD cloud (as they are mainly published as CSVs [13]) we have achieved the following tasks in this work: • We have described a hierarchical knowledge graph of spatial and temporal entities in terms of SPARQL queries, as well as the integration of temporal information and its interlinkage with the geospatial-knowledge from various Linked data sources (GeoNames, OSM, Wikidata, PeriodO), where our general approach is extensible to adding new sources; further details of the construction are provided in the Appendix.
• We have described algorithms to annotate CSV tables and their respective metadata descriptions from Open Data Portals and we have annotated datasets and metadata from 11 European data portals.
• To demonstrate the performance and limitations of our spatio-temporal labelling we have evaluated the annotations by manual inspection of a random sample per data portal, where we identified correct geo-annotations for around 90% of the inspected datasets.
• To access and query the data, we offer an user interface, RESTful APIs and a SPARQL endpoint, which allows structured queries over our spatio-temporal annotations.
To the best of our knowledge, this is the first work addressing a spatial-temporal labelling and allowing structured spatio-temporal search of datasets based on a knowledge graphs of temporal and geoentities.
To further improve geo-labelling, we are currently working on parsing coordinates in datasets. To do so we have to consider that the long/lat pairs potentially come in column groups, i.e., one column per coordinate, which we might need to combine. Having all the geometries for the geo-entities and data points, we want to integrate a visual representation and search interface for datasets by displaying and filtering the datasets/records on a map.
While CSV is a popular and dominant datapublishing format on the Web [13], we also want to extend our indexing to other popular Open Data formats (such as XLS and JSON). Additionally, we want to test how well our approaches could be applied to unstructured or semi-structured data and other domains such as tweets or web pages (e.g., newspaper articles), or complementarily, we could use our knowledge graph, along with known methods for temporal and geo-labelling of such unstructured sources, to link them to (supporting) data, to enable for instance fact checking. The applications of Open Data sources searchable and annotated in such a manner seem promising and widespread. suffice to extract the data relevant for us in Section 3. A common problem with these sources is however that either such a SPARQL endpoint is not available or does not support complex queries. To this end, we discuss in this appendix how such limitations could be circumvented in the specific cases. We note that we expect the presented workaround could be similarly applied to other use cases for extracting relevant data from large RDF dumps or public endpoints, so we hope the discussion herein might be useful also for others. "T23:59:59"))) AS ?EndDateTime ) } Figure A.19: SPARQL query on local Wikidata extract -Namespaces as in Figure 2