善用Wikidata语义网
人们用他们熟悉的自然语言如中文在网上编辑了海量的信息,但自然语言固有的极端复杂性使这些文本难以为机器所理解,于是人们因为找不到已有的信息而重复劳动,这个恶性循环的结果是互联网上文章泛滥但质量普遍低下,特别是过时信息得不到清理。相反,语义网提倡用更规范的方式表达知识,并把它们连接起来。维基基金会的Wikidata就是一个人人可编辑和利用的知识库,有大量志愿者持续扩展和修正它,这使它的数据无论在数量上还是质量上都处于领先地位,维基百科中条目右上方卡片的信息就来自Wikidata。值得一提的是Wikidata的数据位于公众领域,可以(接近)无条件地合法使用。
探索语义网
世界上有许多实体,其中包括具体的对象如“铅笔”和抽象的概念如“实数”。在自然语言中实体用名词表示,在语义网中我们用IRI表示实体。在自然语言中实体与名词的关系是多对多的(比如一个人可以有多个绰号,不同人也可以同名同姓),而在语义网中实体与IRI的关系是一对一的,这使得语义网中可以避开通常句子中的歧义问题。同时,使用IRI表示实体让我们可方便地获取关于实体的有用信息:把它当作URL去访问即可。
断言指出实体的单个特点,大致对应于自然语言中最简单的陈述句。自然语言中的陈述句往往相当复杂,同时表达了多种含义,但语义网中的每个断言应该只表达一项单一的知识。语义网中最常见的断言由一个主体、一个属性谓词和一个客体组成,多数情况下它们都是实体,但也可能是数值、字符串等等。
语义网的基本模型就这么简单,让我们来体验一下Wikidata如何体现语义网的原则。在浏览器打开一个实体IRI ,比如说http://www.wikidata.org/entity/Q753535,其中以Q开始的是Wikidata中实体的编号:
我们可以看到以下信息:
- 这个实体在不同语言中的惟一最常见名称,如在中文是“金瓶梅”。
- 这个实体在不同语言中的各个别名,如在英文有“The Plum in the Golden Vase”、“.The Golden Lotus”、“Plum in the Golden Vase”、“Golden Lotus”和“Jinpingmei”。
- 这个实体在不同语言中的补充描述(用于区分自然语言中同名的实体),如在中文是“中国古典小说”。
- 其它以这实体为主体的断言(不同实体有不同的属性),比如说由此我们可知:
- 这个实体是一本书
- 这个实体的某一张图
- 这个实体始于1610年
- 这个实体源于中国
- 这个实体在维基百科的页面
我们看到,谓词和大部分客体都是实体,在上述页面中我们可以通过链接继续浏览它们,例如我们可进一步了解作者“兰陵笑笑生”,相关实体之间的断言使Wikidata成为真正的网(用图论的话,实体对应顶点而断言对应边),我们可沿这个网逐渐展开知识。部分断言更带有出处,方便进行考证。
当然在浏览语义网前要知道实体的IRI,幸运的是Wikidata网站右上角的搜索框可以按实体名称或别名搜索实体。
HTML格式虽然方便人们用浏览器查看,但用程序处理时类似JSON的格式更方便且节省流量,这时可以改用https://www.wikidata.org/wiki/Special:EntityData/Q753535.json,类似地按要求的格式可改用rdf
、ttl
或nl
后缀,以下演示返回结果的结构(但内容经过大幅删减):
{
"entities": {
"Q753535": {
"pageid": 708936,
"ns": 0,
"title": "Q753535",
"lastrevid": 805212899,
"modified": "2018-12-05T03:30:00Z",
"type": "item",
"id": "Q753535",
"labels": {
"zh-hans": {
"language": "zh-hans",
"value": "金瓶梅"
},
"zh-hant": {
"language": "zh-hant",
"value": "金瓶梅"
},
"zh-hk": {
"language": "zh-hk",
"value": "金瓶梅"
},
"en": {
"language": "en",
"value": "Jin Ping Mei"
},
"zh": {
"language": "zh",
"value": "金瓶梅"
}
},
"descriptions": {
"en": {
"language": "en",
"value": "Chinese naturalistic novel"
},
"zh": {
"language": "zh",
"value": "中国古典小说"
}
},
"aliases": {
"zh": [{
"language": "zh",
"value": "金瓶梅词话"
}],
"en": [{
"language": "en",
"value": "The Plum in the Golden Vase"
}, {
"language": "en",
"value": ".The Golden Lotus"
}, {
"language": "en",
"value": "Plum in the Golden Vase"
}, {
"language": "en",
"value": "Golden Lotus"
}, {
"language": "en",
"value": "Jinpingmei"
}]
},
"claims": {
"P50": [{
"mainsnak": {
"snaktype": "value",
"property": "P50",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 832785,
"id": "Q832785"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
},
"type": "statement",
"id": "Q753535$fe287f80-43c8-22f7-a0a0-32315931fa20",
"rank": "normal"
}],
"P373": [{
"mainsnak": {
"snaktype": "value",
"property": "P373",
"datavalue": {
"value": "Jinpingmei",
"type": "string"
},
"datatype": "string"
},
"type": "statement",
"id": "q753535$7B87F320-CD1A-4A8F-A36D-0F562AFD6550",
"rank": "normal",
"references": [{
"hash": "fa278ebfc458360e5aed63d5058cca83c46134f1",
"snaks": {
"P143": [{
"snaktype": "value",
"property": "P143",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 328,
"id": "Q328"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
}]
},
"snaks-order": ["P143"]
}]
}],
"P31": [{
"mainsnak": {
"snaktype": "value",
"property": "P31",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 571,
"id": "Q571"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
},
"type": "statement",
"id": "q753535$D7105EE3-F496-4639-B951-43C0F52D53BA",
"rank": "normal",
"references": [{
"hash": "fa278ebfc458360e5aed63d5058cca83c46134f1",
"snaks": {
"P143": [{
"snaktype": "value",
"property": "P143",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 328,
"id": "Q328"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
}]
},
"snaks-order": ["P143"]
}]
}],
"P957": [{
"mainsnak": {
"snaktype": "value",
"property": "P957",
"datavalue": {
"value": "2-07-031490-1",
"type": "string"
},
"datatype": "external-id"
},
"type": "statement",
"id": "Q753535$1A821766-DDAF-4185-BB09-FCCEEBA8C0EC",
"rank": "normal"
}],
"P136": [{
"mainsnak": {
"snaktype": "value",
"property": "P136",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 8261,
"id": "Q8261"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
},
"type": "statement",
"id": "Q753535$63866EAC-F73C-4B89-9211-7771FE802D7B",
"rank": "normal"
}, {
"mainsnak": {
"snaktype": "value",
"property": "P136",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 11452132,
"id": "Q11452132"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
},
"type": "statement",
"id": "Q753535$E7AFFA14-BD4A-4F6A-BDE6-1BE66555758E",
"rank": "normal",
"references": [{
"hash": "fa278ebfc458360e5aed63d5058cca83c46134f1",
"snaks": {
"P143": [{
"snaktype": "value",
"property": "P143",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 328,
"id": "Q328"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
}]
},
"snaks-order": ["P143"]
}]
}],
"P577": [{
"mainsnak": {
"snaktype": "value",
"property": "P577",
"datavalue": {
"value": {
"time": "+1610-00-00T00:00:00Z",
"timezone": 0,
"before": 0,
"after": 0,
"precision": 9,
"calendarmodel": "http://www.wikidata.org/entity/Q1985727"
},
"type": "time"
},
"datatype": "time"
},
"type": "statement",
"id": "Q753535$b7a4cc71-4f11-0068-df03-a566332ebf45",
"rank": "normal"
}],
"P18": [{
"mainsnak": {
"snaktype": "value",
"property": "P18",
"datavalue": {
"value": "Jin Ping Mei.jpg",
"type": "string"
},
"datatype": "commonsMedia"
},
"type": "statement",
"id": "Q753535$B4550D8C-62D4-4868-8020-23973E43313A",
"rank": "normal",
"references": [{
"hash": "dbc378ca38aa06d2ac2a1b790ffe3a5ee9d0bd17",
"snaks": {
"P143": [{
"snaktype": "value",
"property": "P143",
"datavalue": {
"value": {
"entity-type": "item",
"numeric-id": 53464,
"id": "Q53464"
},
"type": "wikibase-entityid"
},
"datatype": "wikibase-item"
}]
},
"snaks-order": ["P143"]
}]
}]
},
"sitelinks": {
"enwiki": {
"site": "enwiki",
"title": "Jin Ping Mei",
"badges": [],
"url": "https://en.wikipedia.org/wiki/Jin_Ping_Mei"
},
"enwikiquote": {
"site": "enwikiquote",
"title": "Jin Ping Mei",
"badges": [],
"url": "https://en.wikiquote.org/wiki/Jin_Ping_Mei"
},
"zhwiki": {
"site": "zhwiki",
"title": "金瓶梅",
"badges": [],
"url": "https://zh.wikipedia.org/wiki/%E9%87%91%E7%93%B6%E6%A2%85"
},
"zhwikisource": {
"site": "zhwikisource",
"title": "金瓶梅",
"badges": [],
"url": "https://zh.wikisource.org/wiki/%E9%87%91%E7%93%B6%E6%A2%85"
}
}
}
}
}
如果需要更仔细的控制,如只要求部分语言和属性,又或一次性取得多个实体的信息,可以参考Wikidata API的文档,比如说https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q1&props=labels|aliases|descriptions&languages=en|zh&format=json
可取得实体Q1
在中文和英文的名称、别名和描述:
{
"entities": {
"Q1": {
"type": "item",
"id": "Q1",
"labels": {
"zh": {
"language": "zh",
"value": "\u5b87\u5b99"
},
"en": {
"language": "en",
"value": "Universe"
}
},
"descriptions": {
"zh": {
"language": "zh",
"value": "\u4e00\u5207\u7a7a\u95f4\u3001\u65f6\u95f4\u3001\u7269\u8d28\u548c\u80fd\u91cf\u6784\u6210\u7684\u603b\u4f53"
},
"en": {
"language": "en",
"value": "totality of space and all contents"
}
},
"aliases": {
"en": [
{
"language": "en",
"value": "Our Universe"
},
{
"language": "en",
"value": "The Universe"
},
{
"language": "en",
"value": "Universe (Ours)"
},
{
"language": "en",
"value": "The Cosmos"
},
{
"language": "en",
"value": "cosmos"
}
]
}
}
},
"success": 1
}
同样地,搜索也可用API,如https://www.wikidata.org/w/api.php?action=wbsearchentities&search=apache&language=en&limit=2&format=json
可搜索英文词”apache“并返回最多两个结果:
{
"searchinfo": {
"search": "apache"
},
"search": [
{
"repository": "",
"id": "Q102090",
"conceptIRI": "http://www.wikidata.org/entity/Q102090",
"title": "Q102090",
"pageid": 104694,
"url": "//www.wikidata.org/wiki/Q102090",
"label": "Apache",
"description": "several culturally related groups of Native Americans in the United States",
"match": {
"type": "label",
"language": "en",
"text": "Apache"
}
},
{
"repository": "",
"id": "Q11354",
"conceptIRI": "http://www.wikidata.org/entity/Q11354",
"title": "Q11354",
"pageid": 12854,
"url": "//www.wikidata.org/wiki/Q11354",
"label": "Apache HTTP Server",
"description": "open-source web server software",
"match": {
"type": "label",
"language": "en",
"text": "Apache HTTP Server"
}
}
],
"search-continue": 2,
"success": 1
}
发掘语义网
上面介绍了获取关于一个实体的信息的方法,但现实中我们通常想要的是关于满足一定条件的所有实体的信息。虽然把Wikidata中所有数据从https://dumps.wikimedia.org/wikidatawiki/entities/下载下来是可能的,但下载至少几十GB的压缩包下来本地处理对于许多用途而言太笨重了,而且如果不定期与上游同步(如下载最近30天的变更)数据就会开始变得过时。为此,Wikidata提供了一个查询服务https://query.wikidata.org/,大家可以在浏览器使用其IDE编辑查询,也可以写程序向URLhttps://query.wikidata.org/sparql?query=经编码的查询
发送GET
或POST
请求,返回结果的格式可用HTTP
头Accept:
的值application/sparql-results+xml
、application/sparql-results+json
、text/tab-separated-values
、text/csv
或application/x-binary-rdf-results-table
决定。
这个查询服务接受的查询语言是SPARQL,与受Prolog启发的SQL高度类似。为了尽早熟悉这语言,不妨先尝试https://query.wikidata.org/页面提供的示例,然后可以尝试调整过滤器看看结果有什么变化。接着可以从空白查询出发通过新增过滤器建立查询,例如先输入”黄霑“选取Q709317作为客体,再选择属性作词者(P676)就会得到查询(含义是“哪些歌的作词人是黄霑?”):
SELECT ?__ ?__Label WHERE {
SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
?__ wdt:P676 wd:Q709317.
}
LIMIT 100
提交查询就会得到结果如:
__ |
__Label |
---|---|
wd:Q1743796 |
我的中國心 |
wd:Q10867969 |
上海灘 |
wd:Q15909571 |
獅子山下 |
SPARQL
以下我们说明怎么写SPARQL查询。
简单查询
如前所述,一个断言本质上是由主体、属性谓词和客体组成的三元组,它们三者通常都是用IRI表示的实体,因而一个典型的断言形如http://www.wikidata.org/entity/Q444381 http://www.wikidata.org/prop/direct/P50 http://www.wikidata.org/entity/Q23114
(这里http://www.wikidata.org/entity/Q444381
表示《孔乙己》、http://www.wikidata.org/prop/direct/P50
表示属性“作者是”、http://www.wikidata.org/entity/Q23114
表示鲁迅,因而这断言的意思是“《孔乙己》的作者是鲁迅”)。由于每次都使用完整的IRI太冗长,我们通常把常用的前缀如http://www.wikidata.org/entity/
和http://www.wikidata.org/prop/direct/
分别简记为wd:
和wdt:
,于是上述断言可以简化为wd:Q444381 wdt:P50 wd:Q23114
。当然,我们也可以自定义其它名称空间,如在查询前加上PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
后查询中就可用rdfs:
代替IRI前缀http://www.w3.org/2000/01/rdf-schema#
。
现在我们想在已知断言的一部分的前提下找出断言的其余部分,于是我们把断言中未知的部分用?变量名
代替(变量名是我们选取的,不同变量名代表不同的未知量,如果不关心变量值也可用_:变量名
),然后在前面加上SELECT * WHERE{
而在后面加上}
就得到查询,比如说:
查询 | 效果 |
---|---|
SELECT * WHERE {?work wdt:P50 wd:Q23114} |
查找鲁迅有哪些作品 |
SELECT * WHERE {wd:Q444381 ?relationship wd:Q23114} |
查找《孔乙己》与鲁迅有什么关系 |
SELECT * WHERE {wd:Q444381 wdt:P50 ?author} |
查找《孔乙己》的作者是谁 |
SELECT * WHERE {?work wdt:P50 ?author} |
查找所有作品和它们分别的作者是谁 |
尝试这些查询会分别得到一个表格,表格的各个列分别表示各未知变量的值而各行表示不同断言。不过,由于表格中的值都是实体的IRI,我们并不知道它们分别代表什么,我们更想要的是实体的名称。由于实体到它的名称通过属性谓词rdfs:label
给出(顺便一提,别名由skos:altLabel
给出、描述由schema:description
给出),我们把查询修改为SELECT * WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName}
(这里两个元组用句点.
分隔)。读者可能已经看出这相当于Prolog中的合一或SQL中的表连接,现在我们得到了鲁迅各作品的IRI和对应的名称。
不过,我们注意到同一作品在不同语言中名称都在不同行出来了,而我们只对中文名称感兴趣。由于语义网中的字符串是带语言标签的(如英文字符串表示为类似"Hello world"@en
、繁体中文字符串表示为类似"軟體"@zh-hant
),于是我们把查询改为SELECT * WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. FILTER(LANG(?workName)='zh')}
。
假如我们还想要作品的英文名称,我们自然会写出SELECT * WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. FILTER(LANG(?workName)='zh'). ?work rdfs:label ?workNameEn. FILTER(LANG(?workNameEn)='en')}
,但你会看到结果变少了,原因是许多作品没有英文名。如果我们只要求在有英文名时返回英文名,则可把查询改为:
SELECT * WHERE {
?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName.
FILTER(LANG(?workName)='zh').
OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
}
假如我们对实体IRI完全不感兴趣,可以把查询改为SELECT ?workName WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. FILTER(LANG(?workName)='zh')}
,这样结果中只会有?workName
一列,就像SQL中的投影。顺便一提,连接再投影后可能有重复结果,例如用SELECT ?coAuthorName WHERE {?work wdt:P50 wd:Q173746.?work wdt:P50 ?coAuthor.FILTER(?coAuthor != wd:Q173746)?coAuthor rdfs:label ?coAuthorName.FILTER(LANG(?coAuthorName)='en')}
查询埃尔德什的合著者英文姓名的名单时,结果中有重复是由于有人与他合作多次的结果。要消除重复只用在SELECT
后加上DISTINCT
即可(如果我们只容许但不要求消除重复,则改用REDUCED
)。
反过来,我们可能想要在结果可增加列,新列的值由原有列的值计算得到。例如我们可能想把英文标题变成大写:
SELECT ?workName ?workNameEnCap WHERE {
?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName.
FILTER(LANG(?workName)='zh').
OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
BIND(UCASE(?workNameEn) AS ?workNameEnCap)
}
一般来说结果集是无序的,但我们可以要求按变量值或更一般表达式的值排序,例如以下按英文名倒序排(排序则把DESC
改为ASC
):
SELECT ?workName ?workNameEnCap WHERE {
?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName.
FILTER(LANG(?workName)='zh').
OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
BIND(UCASE(?workNameEn) AS ?workNameEnCap)
}ORDER BY DESC(?workNameEnCap)
有时候结果多了看不过来,这就需要对结果集分页。如只要求返回第10个结果开始的5个结果:
SELECT ?workName ?workNameEnCap WHERE {
?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName.
FILTER(LANG(?workName)='zh').
OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
BIND(UCASE(?workNameEn) AS ?workNameEnCap)
}ORDER BY DESC(?workNameEnCap)
OFFSET 10 LIMIT 5
SPARQL还提供了一些语法糖让查询写起来更紧凑一点,以下举例说明:
片段 | 简化 | 附注 |
---|---|---|
?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName |
?work wdt:P50 wd:Q23114; rdfs:label ?workName |
相同主体的三元组 |
?work rdfs:label ?workName. ?work rdfs:label ?workNameEn |
?work rdfs:label ?workName,?workNameEn |
相同主体和谓词的元组 |
wd:Q444381 wdt:P50 _:author._:author wdt:P1412 ?language |
wd:Q444381 wdt:P50/wdt:P1412 ?language |
间接关系 |
聚合查询
与SQL类似,SPARQL也有聚合子句。比如我们想查询鲁迅不同年份的作品数,可以使用以下查询:
SELECT ?year (COUNT(?work) AS ?count) WHERE {
?work wdt:P50 wd:Q23114.
?work wdt:P577 ?date.
}GROUP BY (YEAR(?date) AS ?year)
ORDER BY ?year
我们还可以过滤分组,比如只要作品数多于一的年份:
SELECT ?year (COUNT(?work) AS ?count) WHERE {
?work wdt:P50 wd:Q23114.
?work wdt:P577 ?date.
}GROUP BY (YEAR(?date) AS ?year)
HAVING (COUNT(?work)>1)
ORDER BY ?year
除了COUNT
外,还有MIN
、MAX
、AVG
、SAMPLE
和GROUP_CONCAT
可作聚合函数,需要去重复再聚合的话可以在(
后加上DISTINCT
。
完整的查询语法
有了前面的例子,我们已经了解SPARQL中最常用的SELECT
语句的结构,另外还有用于判断结果集是否非空但不返回结果集的ASK
语句、用于用结果集构造图的CONSTRUCT
语句和返回实体全部属性的DESCRIBE
语句。以下给出形式化的定义(关键词除用于表示类型谓词的a
外都不区分大小写):
[1] QueryUnit ::= Query
[2] Query ::= Prologue
( SelectQuery | ConstructQuery | DescribeQuery | AskQuery )
ValuesClause
[4] Prologue ::= ( BaseDecl | PrefixDecl )*
[5] BaseDecl ::= 'BASE' IRIREF
[6] PrefixDecl ::= 'PREFIX' PNAME_NS IRIREF
[7] SelectQuery ::= SelectClause DatasetClause* WhereClause SolutionModifier
[8] SubSelect ::= SelectClause WhereClause SolutionModifier ValuesClause
[9] SelectClause ::= 'SELECT' ( 'DISTINCT' | 'REDUCED' )? ( ( Var | ( '(' Expression 'AS' Var ')' ) )+ | '*' )
[10] ConstructQuery ::= 'CONSTRUCT' ( ConstructTemplate DatasetClause* WhereClause SolutionModifier | DatasetClause* 'WHERE' '{' TriplesTemplate? '}' SolutionModifier )
[11] DescribeQuery ::= 'DESCRIBE' ( VarOrIri+ | '*' ) DatasetClause* WhereClause? SolutionModifier
[12] AskQuery ::= 'ASK' DatasetClause* WhereClause SolutionModifier
[13] DatasetClause ::= 'FROM' ( DefaultGraphClause | NamedGraphClause )
[14] DefaultGraphClause ::= SourceSelector
[15] NamedGraphClause ::= 'NAMED' SourceSelector
[16] SourceSelector ::= iri
[17] WhereClause ::= 'WHERE'? GroupGraphPattern
[18] SolutionModifier ::= GroupClause? HavingClause? OrderClause? LimitOffsetClauses?
[19] GroupClause ::= 'GROUP' 'BY' GroupCondition+
[20] GroupCondition ::= BuiltInCall | FunctionCall | '(' Expression ( 'AS' Var )? ')' | Var
[21] HavingClause ::= 'HAVING' HavingCondition+
[22] HavingCondition ::= Constraint
[23] OrderClause ::= 'ORDER' 'BY' OrderCondition+
[24] OrderCondition ::= ( ( 'ASC' | 'DESC' ) BrackettedExpression )
| ( Constraint | Var )
[25] LimitOffsetClauses ::= LimitClause OffsetClause? | OffsetClause LimitClause?
[26] LimitClause ::= 'LIMIT' INTEGER
[27] OffsetClause ::= 'OFFSET' INTEGER
[28] ValuesClause ::= ( 'VALUES' DataBlock )?
[52] TriplesTemplate ::= TriplesSameSubject ( '.' TriplesTemplate? )?
其中用于过滤断言的WHERE子句的组成如下:
[53] GroupGraphPattern ::= '{' ( SubSelect | GroupGraphPatternSub ) '}'
[54] GroupGraphPatternSub ::= TriplesBlock? ( GraphPatternNotTriples '.'? TriplesBlock? )*
[55] TriplesBlock ::= TriplesSameSubjectPath ( '.' TriplesBlock? )?
[56] GraphPatternNotTriples ::= GroupOrUnionGraphPattern | OptionalGraphPattern | MinusGraphPattern | GraphGraphPattern | ServiceGraphPattern | Filter | Bind | InlineData
[57] OptionalGraphPattern ::= 'OPTIONAL' GroupGraphPattern
[58] GraphGraphPattern ::= 'GRAPH' VarOrIri GroupGraphPattern
[59] ServiceGraphPattern ::= 'SERVICE' 'SILENT'? VarOrIri GroupGraphPattern
[60] Bind ::= 'BIND' '(' Expression 'AS' Var ')'
[61] InlineData ::= 'VALUES' DataBlock
[62] DataBlock ::= InlineDataOneVar | InlineDataFull
[63] InlineDataOneVar ::= Var '{' DataBlockValue* '}'
[64] InlineDataFull ::= ( NIL | '(' Var* ')' ) '{' ( '(' DataBlockValue* ')' | NIL )* '}'
[65] DataBlockValue ::= iri | RDFLiteral | NumericLiteral | BooleanLiteral | 'UNDEF'
[66] MinusGraphPattern ::= 'MINUS' GroupGraphPattern
[67] GroupOrUnionGraphPattern ::= GroupGraphPattern ( 'UNION' GroupGraphPattern )*
[68] Filter ::= 'FILTER' Constraint
[69] Constraint ::= BrackettedExpression | BuiltInCall | FunctionCall
[70] FunctionCall ::= iri ArgList
[71] ArgList ::= NIL | '(' 'DISTINCT'? Expression ( ',' Expression )* ')'
[72] ExpressionList ::= NIL | '(' Expression ( ',' Expression )* ')'
[73] ConstructTemplate ::= '{' ConstructTriples? '}'
[74] ConstructTriples ::= TriplesSameSubject ( '.' ConstructTriples? )?
[75] TriplesSameSubject ::= VarOrTerm PropertyListNotEmpty | TriplesNode PropertyList
[76] PropertyList ::= PropertyListNotEmpty?
[77] PropertyListNotEmpty ::= Verb ObjectList ( ';' ( Verb ObjectList )? )*
[78] Verb ::= VarOrIri | 'a'
[79] ObjectList ::= Object ( ',' Object )*
[80] Object ::= GraphNode
[81] TriplesSameSubjectPath ::= VarOrTerm PropertyListPathNotEmpty | TriplesNodePath PropertyListPath
[82] PropertyListPath ::= PropertyListPathNotEmpty?
[83] PropertyListPathNotEmpty ::= ( VerbPath | VerbSimple ) ObjectListPath ( ';' ( ( VerbPath | VerbSimple ) ObjectList )? )*
[84] VerbPath ::= Path
[85] VerbSimple ::= Var
[86] ObjectListPath ::= ObjectPath ( ',' ObjectPath )*
[87] ObjectPath ::= GraphNodePath
[88] Path ::= PathAlternative
[89] PathAlternative ::= PathSequence ( '|' PathSequence )*
[90] PathSequence ::= PathEltOrInverse ( '/' PathEltOrInverse )*
[91] PathElt ::= PathPrimary PathMod?
[92] PathEltOrInverse ::= PathElt | '^' PathElt
[93] PathMod ::= '?' | '*' | '+'
[94] PathPrimary ::= iri | 'a' | '!' PathNegatedPropertySet | '(' Path ')'
[95] PathNegatedPropertySet ::= PathOneInPropertySet | '(' ( PathOneInPropertySet ( '|' PathOneInPropertySet )* )? ')'
[96] PathOneInPropertySet ::= iri | 'a' | '^' ( iri | 'a' )
[97] Integer ::= INTEGER
[98] TriplesNode ::= Collection | BlankNodePropertyList
[99] BlankNodePropertyList ::= '[' PropertyListNotEmpty ']'
[100] TriplesNodePath ::= CollectionPath | BlankNodePropertyListPath
[101] BlankNodePropertyListPath ::= '[' PropertyListPathNotEmpty ']'
[102] Collection ::= '(' GraphNode+ ')'
[103] CollectionPath ::= '(' GraphNodePath+ ')'
[104] GraphNode ::= VarOrTerm | TriplesNode
[105] GraphNodePath ::= VarOrTerm | TriplesNodePath
[106] VarOrTerm ::= Var | GraphTerm
[107] VarOrIri ::= Var | iri
[108] Var ::= VAR1 | VAR2
[109] GraphTerm ::= iri | RDFLiteral | NumericLiteral | BooleanLiteral | BlankNode | NIL
接下来是表达式语法,虽然看起来可能有点复杂,但由于语法和语义上都与其它常见语言差不多,没有必要作太多讲解。
[110] Expression ::= ConditionalOrExpression
[111] ConditionalOrExpression ::= ConditionalAndExpression ( '||' ConditionalAndExpression )*
[112] ConditionalAndExpression ::= ValueLogical ( '&&' ValueLogical )*
[113] ValueLogical ::= RelationalExpression
[114] RelationalExpression ::= NumericExpression ( '=' NumericExpression | '!=' NumericExpression | '<' NumericExpression | '>' NumericExpression | '<=' NumericExpression | '>=' NumericExpression | 'IN' ExpressionList | 'NOT' 'IN' ExpressionList )?
[115] NumericExpression ::= AdditiveExpression
[116] AdditiveExpression ::= MultiplicativeExpression ( '+' MultiplicativeExpression | '-' MultiplicativeExpression | ( NumericLiteralPositive | NumericLiteralNegative ) ( ( '*' UnaryExpression ) | ( '/' UnaryExpression ) )* )*
[117] MultiplicativeExpression ::= UnaryExpression ( '*' UnaryExpression | '/' UnaryExpression )*
[118] UnaryExpression ::= '!' PrimaryExpression
| '+' PrimaryExpression
| '-' PrimaryExpression
| PrimaryExpression
[119] PrimaryExpression ::= BrackettedExpression | BuiltInCall | iriOrFunction | RDFLiteral | NumericLiteral | BooleanLiteral | Var
[120] BrackettedExpression ::= '(' Expression ')'
[121] BuiltInCall ::= Aggregate
| 'STR' '(' Expression ')'
| 'LANG' '(' Expression ')'
| 'LANGMATCHES' '(' Expression ',' Expression ')'
| 'DATATYPE' '(' Expression ')'
| 'BOUND' '(' Var ')'
| 'IRI' '(' Expression ')'
| 'IRI' '(' Expression ')'
| 'BNODE' ( '(' Expression ')' | NIL )
| 'RAND' NIL
| 'ABS' '(' Expression ')'
| 'CEIL' '(' Expression ')'
| 'FLOOR' '(' Expression ')'
| 'ROUND' '(' Expression ')'
| 'CONCAT' ExpressionList
| SubstringExpression
| 'STRLEN' '(' Expression ')'
| StrReplaceExpression
| 'UCASE' '(' Expression ')'
| 'LCASE' '(' Expression ')'
| 'ENCODE_FOR_IRI' '(' Expression ')'
| 'CONTAINS' '(' Expression ',' Expression ')'
| 'STRSTARTS' '(' Expression ',' Expression ')'
| 'STRENDS' '(' Expression ',' Expression ')'
| 'STRBEFORE' '(' Expression ',' Expression ')'
| 'STRAFTER' '(' Expression ',' Expression ')'
| 'YEAR' '(' Expression ')'
| 'MONTH' '(' Expression ')'
| 'DAY' '(' Expression ')'
| 'HOURS' '(' Expression ')'
| 'MINUTES' '(' Expression ')'
| 'SECONDS' '(' Expression ')'
| 'TIMEZONE' '(' Expression ')'
| 'TZ' '(' Expression ')'
| 'NOW' NIL
| 'UUID' NIL
| 'STRUUID' NIL
| 'MD5' '(' Expression ')'
| 'SHA1' '(' Expression ')'
| 'SHA256' '(' Expression ')'
| 'SHA384' '(' Expression ')'
| 'SHA512' '(' Expression ')'
| 'COALESCE' ExpressionList
| 'IF' '(' Expression ',' Expression ',' Expression ')'
| 'STRLANG' '(' Expression ',' Expression ')'
| 'STRDT' '(' Expression ',' Expression ')'
| 'sameTerm' '(' Expression ',' Expression ')'
| 'isIRI' '(' Expression ')'
| 'isIRI' '(' Expression ')'
| 'isBLANK' '(' Expression ')'
| 'isLITERAL' '(' Expression ')'
| 'isNUMERIC' '(' Expression ')'
| RegexExpression
| ExistsFunc
| NotExistsFunc
[122] RegexExpression ::= 'REGEX' '(' Expression ',' Expression ( ',' Expression )? ')'
[123] SubstringExpression ::= 'SUBSTR' '(' Expression ',' Expression ( ',' Expression )? ')'
[124] StrReplaceExpression ::= 'REPLACE' '(' Expression ',' Expression ',' Expression ( ',' Expression )? ')'
[125] ExistsFunc ::= 'EXISTS' GroupGraphPattern
[126] NotExistsFunc ::= 'NOT' 'EXISTS' GroupGraphPattern
[127] Aggregate ::= 'COUNT' '(' 'DISTINCT'? ( '*' | Expression ) ')'
| 'SUM' '(' 'DISTINCT'? Expression ')'
| 'MIN' '(' 'DISTINCT'? Expression ')'
| 'MAX' '(' 'DISTINCT'? Expression ')'
| 'AVG' '(' 'DISTINCT'? Expression ')'
| 'SAMPLE' '(' 'DISTINCT'? Expression ')'
| 'GROUP_CONCAT' '(' 'DISTINCT'? Expression ( ';' 'SEPARATOR' '=' String )? ')'
[128] iriOrFunction ::= iri ArgList?
[129] RDFLiteral ::= String ( LANGTAG | ( '^^' iri ) )?
[130] NumericLiteral ::= NumericLiteralUnsigned | NumericLiteralPositive | NumericLiteralNegative
[131] NumericLiteralUnsigned ::= INTEGER | DECIMAL | DOUBLE
[132] NumericLiteralPositive ::= INTEGER_POSITIVE | DECIMAL_POSITIVE | DOUBLE_POSITIVE
[133] NumericLiteralNegative ::= INTEGER_NEGATIVE | DECIMAL_NEGATIVE | DOUBLE_NEGATIVE
[134] BooleanLiteral ::= 'true' | 'false'
[135] String ::= STRING_LITERAL1 | STRING_LITERAL2 | STRING_LITERAL_LONG1 | STRING_LITERAL_LONG2
[136] iri ::= IRIREF | PrefixedName
[137] PrefixedName ::= PNAME_LN | PNAME_NS
[138] BlankNode ::= BLANK_NODE_LABEL | ANON
最后来到最基本的词法。SPARQL的单词可分为:
- IRI。
- 由
<
与>
包围的IRI。对于相对IRI,会基于BaseDecl
中给出的IRI解析。 - 形如
名字空间:本地名
,表示由名字空间表示的前缀后紧接本地名,两个部分都可以省略。
- 由
- 字面值。用成对的
'
、"
、'''
或"""
包围的文本,其中可用常见的反斜杠转义序列。字符串后可接@
和语言标签来指定语言,或者接^^
和用IRI表示的数据类型。对于一些常见类型有缩写,比如:1
相当于"1"^^xsd:integer
1.3
相当于"1.3"^^xsd:decimal
1.0e6
相当于"1.0e6"^^xsd:double
true
相当于"true"^^xsd:boolean
false
相当于"false"^^xsd:boolean
- 空白结点。形如
_:标签
或[]
,不同标签空白结点类似于不同的变量,可用于匹配但不需要引用值的场合。 - 变量。由
$
或?
开首接变量名组成,其中变量名用一般非特殊字符都可以,至少字母、数字、下划线总是容许的,包括多数Unicode字符。 - 行末注释由
#
开始,会被忽略。
[139] IRIREF ::= '<' ([^<>"{}|^`\]-[#x00-#x20])* '>'
[140] PNAME_NS ::= PN_PREFIX? ':'
[141] PNAME_LN ::= PNAME_NS PN_LOCAL
[142] BLANK_NODE_LABEL ::= '_:' ( PN_CHARS_U | [0-9] ) ((PN_CHARS|'.')* PN_CHARS)?
[143] VAR1 ::= '?' VARNAME
[144] VAR2 ::= '$' VARNAME
[145] LANGTAG ::= '@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
[146] INTEGER ::= [0-9]+
[147] DECIMAL ::= [0-9]* '.' [0-9]+
[148] DOUBLE ::= [0-9]+ '.' [0-9]* EXPONENT | '.' ([0-9])+ EXPONENT | ([0-9])+ EXPONENT
[149] INTEGER_POSITIVE ::= '+' INTEGER
[150] DECIMAL_POSITIVE ::= '+' DECIMAL
[151] DOUBLE_POSITIVE ::= '+' DOUBLE
[152] INTEGER_NEGATIVE ::= '-' INTEGER
[153] DECIMAL_NEGATIVE ::= '-' DECIMAL
[154] DOUBLE_NEGATIVE ::= '-' DOUBLE
[155] EXPONENT ::= [eE] [+-]? [0-9]+
[156] STRING_LITERAL1 ::= "'" ( ([^#x27#x5C#xA#xD]) | ECHAR )* "'"
[157] STRING_LITERAL2 ::= '"' ( ([^#x22#x5C#xA#xD]) | ECHAR )* '"'
[158] STRING_LITERAL_LONG1 ::= "'''" ( ( "'" | "''" )? ( [^'\] | ECHAR ) )* "'''"
[159] STRING_LITERAL_LONG2 ::= '"""' ( ( '"' | '""' )? ( [^"\] | ECHAR ) )* '"""'
[160] ECHAR ::= '\' [tbnrf\"']
[161] NIL ::= '(' WS* ')'
[162] WS ::= #x20 | #x9 | #xD | #xA
[163] ANON ::= '[' WS* ']'
[164] PN_CHARS_BASE ::= [A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[165] PN_CHARS_U ::= PN_CHARS_BASE | '_'
[166] VARNAME ::= ( PN_CHARS_U | [0-9] ) ( PN_CHARS_U | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040] )*
[167] PN_CHARS ::= PN_CHARS_U | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]
[168] PN_PREFIX ::= PN_CHARS_BASE ((PN_CHARS|'.')* PN_CHARS)?
[169] PN_LOCAL ::= (PN_CHARS_U | ':' | [0-9] | PLX ) ((PN_CHARS | '.' | ':' | PLX)* (PN_CHARS | ':' | PLX) )?
[170] PLX ::= PERCENT | PN_LOCAL_ESC
[171] PERCENT ::= '%' HEX HEX
[172] HEX ::= [0-9] | [A-F] | [a-f]
[173] PN_LOCAL_ESC ::= '\' ( '_' | '~' | '.' | '-' | '!' | '$' | '&' | "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@' | '%' )
其实,SPARQL的完整语法还支持修改数据:
[3] UpdateUnit ::= Update
[29] Update ::= Prologue ( Update1 ( ';' Update )? )?
[30] Update1 ::= Load | Clear | Drop | Add | Move | Copy | Create | InsertData | DeleteData | DeleteWhere | Modify
[31] Load ::= 'LOAD' 'SILENT'? iri ( 'INTO' GraphRef )?
[32] Clear ::= 'CLEAR' 'SILENT'? GraphRefAll
[33] Drop ::= 'DROP' 'SILENT'? GraphRefAll
[34] Create ::= 'CREATE' 'SILENT'? GraphRef
[35] Add ::= 'ADD' 'SILENT'? GraphOrDefault 'TO' GraphOrDefault
[36] Move ::= 'MOVE' 'SILENT'? GraphOrDefault 'TO' GraphOrDefault
[37] Copy ::= 'COPY' 'SILENT'? GraphOrDefault 'TO' GraphOrDefault
[38] InsertData ::= 'INSERT DATA' QuadData
[39] DeleteData ::= 'DELETE DATA' QuadData
[40] DeleteWhere ::= 'DELETE WHERE' QuadPattern
[41] Modify ::= ( 'WITH' iri )? ( DeleteClause InsertClause? | InsertClause ) UsingClause* 'WHERE' GroupGraphPattern
[42] DeleteClause ::= 'DELETE' QuadPattern
[43] InsertClause ::= 'INSERT' QuadPattern
[44] UsingClause ::= 'USING' ( iri | 'NAMED' iri )
[45] GraphOrDefault ::= 'DEFAULT' | 'GRAPH'? iri
[46] GraphRef ::= 'GRAPH' iri
[47] GraphRefAll ::= GraphRef | 'DEFAULT' | 'NAMED' | 'ALL'
[48] QuadPattern ::= '{' Quads '}'
[49] QuadData ::= '{' Quads '}'
[50] Quads ::= TriplesTemplate? ( QuadsNotTriples '.'? TriplesTemplate? )*
[51] QuadsNotTriples ::= 'GRAPH' VarOrIri '{' TriplesTemplate? '}'
结语
Wikidata作为一个协作项目,我们在从中获益的同时也理应积极向它捐赠数据,成为一个负责任的公民。例如开发基于它的应用时不妨加入容许用户纠错的机制,并把经审查的纠错信息反馈给上游。