Query Parsing Task Proposal for GeoCLEF 2007
Microsoft Research Asia
A geographic query is usually composed of three components, “what”, “geo-relation” and “where”. How to parse queries and extract these components from them is a key problem for geographic information retrieval (GIR). Therefore, we propose to add a geographic query parsing task for GeoCLEF 2007.
[To be revised]
The contest is open to any party planning to attend CLEF 2007. Multiple submissions per group are allowed, since we will not provide feedback at the time of submission. Only the LAST submission before the deadline will be evaluated.
In the data set, a common query structure will be “what” + “geo-relation” + “where”. The keywords in the “what” component indicate what users want to search; “where” indicates the geographic area users are interested in; “geo-relation” stands for the relationship between “what” and “where”. There also exist non-geographic queries in the data set which need to be recognized.
For example, for a query “Restaurant in Beijing, China”, “what” = “Restaurant”, “where” = “Beijing, China”, and “geo-relation” = “IN”. For another query “Mountains in the south of United States”, “what” = “Mountains”, “where” = “United States”, and “geo-relation” = “SOUTH-OF”.
1. Detect whether the query is a geographic query or not. A geographic query is defined as a query which contains at least a “where” component. For example, “pizza in Seattle, WA” is a geographic query, while “Microsoft software” is not a geographic query. For non-geographic queries, further parsing is not needed.
2. Extract the “where” component from the geographic query and output the corresponding latitude/longitude. For example, in the query “pizza in Seattle, WA”, “Seattle, WA” will be extracted and lat/long value (47.59, -122.33) will be output. Sometimes terms in the “where” component are ambiguous. In this case, the participant should output the lat/long value with the highest confidence. A few queries contain multiple locations, for example, “bus lines from US to Canada”. We will not include these queries in our test set..
3. Extract the “geo-relation” component from the geographic query and normalize it into a pre-defined relation type. A suggested relation type list is shown in Table 1. If the relation type you find is not defined in Table 1, you should categorize it into “UNDEFINED”.
Table 1. Geo-relation Types
Example query |
Geo-relation |
Beijing |
NONE |
in Beijing |
IN |
on the Long Island |
ON |
of Beijing |
OF |
near Beijing next to Beijing |
NEAR |
in or around Beijing in and around Beijing |
IN_NEAR |
along the Rhine |
ALONG |
at Beijing University |
AT |
from Beijing |
FROM |
to Beijing |
TO |
within d miles of Beijing |
DISTANCE |
north of Beijing in the north of Beijing |
NORTH_OF |
south of Beijing in the south of Beijing |
SOUTH_OF |
east of Beijing in the east of Beijing |
EAST_OF |
west of Beijing in the west of Beijing |
WEST_OF |
northeast of Beijing in the northeast of Beijing |
NORTH_EAST_OF |
northwest of Beijing in the northwest of Beijing |
NORTH_WEST_OF |
southeast of Beijing in the southeast of Beijing |
SOUTH_EAST_OF |
southwest of Beijing in the southwest of Beijing |
SOUTH_WEST_OF |
north to Beijing |
NORTH_TO |
south to Beijing |
SOUTH_TO |
east to Beijing |
EAST_TO |
west to Beijing |
WEST_TO |
northeast to Beijing |
NORTH_EAST_TO |
northwest to Beijing |
NORTH_WEST_TO |
southeast to Beijing |
SOUTH_EAST_TO |
southwest to Beijing |
SOUTH_WEST_TO |
4. Extract the “what” component from the geographic query and categorize it into one of three predefined types, which are listed below.
We will provide a test set of 800,000 (to be decided) queries. All queries come from real search engine logs. Most of them will be geographical queries. A sample labeled set of 100 (to be decided) queries will also be provided as a training set.
The test set will be provided in XML format. Each query has two attributes: <QUERYNO> and <QUERY>.
<QUERYNO>1</QUERYNO>
<QUERY>Restaurant in Beijing, China</QUERY>
<QUERYNO>2</QUERYNO>
<QUERY>Real estate in Florida</QUERY>
<QUERYNO>3</QUERYNO>
<QUERY>Mountains in the south of United States</QUERY>
The sample labeled set and the results should be in the following format. There are 4 more attributes: <LOCAL>, <WHAT>, <WHAT_TYPE>, <GEO-RELATION> and <WHERE>.
<QUERYNO>1</QUERYNO>
<QUERY>Restaurant in Beijing, China</QUERY>
<LOCAL>YES</LOCAL>
<WHAT>Restaurant</WHAT>
<WHAT-TYPE> Yellow page</WHAT-TYPE>
<GEO-RELATION>IN</ GEO-RELATION>
<WHERE>Beijing, China</WHERE>
<LAT-LONG>40.24, 116.42</LAT-LONG>
<QUERYNO>2</QUERYNO>
<QUERY> Lottery in Florida</QUERY>
<LOCAL>YES</LOCAL>
<WHAT>Lottery</WHAT>
<WHAT-TYPE>Information</WHAT-TYPE>
<GEO-RELATION>IN</ GEO-RELATION>
<WHERE>Florida</WHERE>
<LAT-LONG>28.38, -81.75</LAT-LONG>
If a submission does not contain all search queries, those queries not included will be treated as errors.
We will evaluate the submitted result based on several criterions, including precision, recall, and F1-score.
We will use multiple human editors to tag a subset of queries selected from the total test set. The collection of human editors is assumed to have the most complete knowledge about internet as compared with any individual end user. You will not know which queries will be used for evaluation and are asked to categorize all queries given.
The evaluation will run on the selected test queries and rank your results by how closely they match to the results from human editors. Here are the set of measures we will use to evaluate results submitted by the contestants:
