陈颂光
全栈工程师,能够独立开发从解释器到网站和桌面/移动端应用的各类软件。
关注我的 GitHub

善用Wikidata语义网

人们用他们熟悉的自然语言如中文在网上编辑了海量的信息,但自然语言固有的极端复杂性使这些文本难以为机器所理解,于是人们因为找不到已有的信息而重复劳动,这个恶性循环的结果是互联网上文章泛滥但质量普遍低下,特别是过时信息得不到清理。相反,语义网提倡用更规范的方式表达知识,并把它们连接起来。维基基金会的Wikidata就是一个人人可编辑和利用的知识库,有大量志愿者持续扩展和修正它,这使它的数据无论在数量上还是质量上都处于领先地位,维基百科中条目右上方卡片的信息就来自Wikidata。值得一提的是Wikidata的数据位于公众领域,可以(接近)无条件地合法使用。

探索语义网

世界上有许多实体,其中包括具体的对象如“铅笔”和抽象的概念如“实数”。在自然语言中实体用名词表示,在语义网中我们用IRI表示实体。在自然语言中实体与名词的关系是多对多的(比如一个人可以有多个绰号,不同人也可以同名同姓),而在语义网中实体与IRI的关系是一对一的,这使得语义网中可以避开通常句子中的歧义问题。同时,使用IRI表示实体让我们可方便地获取关于实体的有用信息:把它当作URL去访问即可。

断言指出实体的单个特点,大致对应于自然语言中最简单的陈述句。自然语言中的陈述句往往相当复杂,同时表达了多种含义,但语义网中的每个断言应该只表达一项单一的知识。语义网中最常见的断言由一个主体、一个属性谓词和一个客体组成,多数情况下它们都是实体,但也可能是数值、字符串等等。

语义网的基本模型就这么简单,让我们来体验一下Wikidata如何体现语义网的原则。在浏览器打开一个实体IRI ,比如说http://www.wikidata.org/entity/Q753535,其中以Q开始的是Wikidata中实体的编号:

Wikidata的一个实体页面

我们可以看到以下信息:

  • 这个实体在不同语言中的惟一最常见名称,如在中文是“金瓶梅”。
  • 这个实体在不同语言中的各个别名,如在英文有“The Plum in the Golden Vase”、“.The Golden Lotus”、“Plum in the Golden Vase”、“Golden Lotus”和“Jinpingmei”。
  • 这个实体在不同语言中的补充描述(用于区分自然语言中同名的实体),如在中文是“中国古典小说”。
  • 其它以这实体为主体的断言(不同实体有不同的属性),比如说由此我们可知:
    • 这个实体是一本书
    • 这个实体的某一张图
    • 这个实体始于1610年
    • 这个实体源于中国
    • 这个实体在维基百科的页面

我们看到,谓词和大部分客体都是实体,在上述页面中我们可以通过链接继续浏览它们,例如我们可进一步了解作者“兰陵笑笑生”,相关实体之间的断言使Wikidata成为真正的网(用图论的话,实体对应顶点而断言对应边),我们可沿这个网逐渐展开知识。部分断言更带有出处,方便进行考证。

当然在浏览语义网前要知道实体的IRI,幸运的是Wikidata网站右上角的搜索框可以按实体名称或别名搜索实体。

HTML格式虽然方便人们用浏览器查看,但用程序处理时类似JSON的格式更方便且节省流量,这时可以改用https://www.wikidata.org/wiki/Special:EntityData/Q753535.json,类似地按要求的格式可改用rdfttlnl后缀,以下演示返回结果的结构(但内容经过大幅删减):

{
    "entities": {
        "Q753535": {
            "pageid": 708936,
            "ns": 0,
            "title": "Q753535",
            "lastrevid": 805212899,
            "modified": "2018-12-05T03:30:00Z",
            "type": "item",
            "id": "Q753535",
            "labels": {
                "zh-hans": {
                    "language": "zh-hans",
                    "value": "金瓶梅"
                },
                "zh-hant": {
                    "language": "zh-hant",
                    "value": "金瓶梅"
                },
                "zh-hk": {
                    "language": "zh-hk",
                    "value": "金瓶梅"
                },
                "en": {
                    "language": "en",
                    "value": "Jin Ping Mei"
                },
                "zh": {
                    "language": "zh",
                    "value": "金瓶梅"
                }
            },
            "descriptions": {
                "en": {
                    "language": "en",
                    "value": "Chinese naturalistic novel"
                },
                "zh": {
                    "language": "zh",
                    "value": "中国古典小说"
                }
            },
            "aliases": {
                "zh": [{
                    "language": "zh",
                    "value": "金瓶梅词话"
                }],
                "en": [{
                    "language": "en",
                    "value": "The Plum in the Golden Vase"
                }, {
                    "language": "en",
                    "value": ".The Golden Lotus"
                }, {
                    "language": "en",
                    "value": "Plum in the Golden Vase"
                }, {
                    "language": "en",
                    "value": "Golden Lotus"
                }, {
                    "language": "en",
                    "value": "Jinpingmei"
                }]
            },
            "claims": {
                "P50": [{
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P50",
                        "datavalue": {
                            "value": {
                                "entity-type": "item",
                                "numeric-id": 832785,
                                "id": "Q832785"
                            },
                            "type": "wikibase-entityid"
                        },
                        "datatype": "wikibase-item"
                    },
                    "type": "statement",
                    "id": "Q753535$fe287f80-43c8-22f7-a0a0-32315931fa20",
                    "rank": "normal"
                }],
                "P373": [{
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P373",
                        "datavalue": {
                            "value": "Jinpingmei",
                            "type": "string"
                        },
                        "datatype": "string"
                    },
                    "type": "statement",
                    "id": "q753535$7B87F320-CD1A-4A8F-A36D-0F562AFD6550",
                    "rank": "normal",
                    "references": [{
                        "hash": "fa278ebfc458360e5aed63d5058cca83c46134f1",
                        "snaks": {
                            "P143": [{
                                "snaktype": "value",
                                "property": "P143",
                                "datavalue": {
                                    "value": {
                                        "entity-type": "item",
                                        "numeric-id": 328,
                                        "id": "Q328"
                                    },
                                    "type": "wikibase-entityid"
                                },
                                "datatype": "wikibase-item"
                            }]
                        },
                        "snaks-order": ["P143"]
                    }]
                }],
                "P31": [{
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P31",
                        "datavalue": {
                            "value": {
                                "entity-type": "item",
                                "numeric-id": 571,
                                "id": "Q571"
                            },
                            "type": "wikibase-entityid"
                        },
                        "datatype": "wikibase-item"
                    },
                    "type": "statement",
                    "id": "q753535$D7105EE3-F496-4639-B951-43C0F52D53BA",
                    "rank": "normal",
                    "references": [{
                        "hash": "fa278ebfc458360e5aed63d5058cca83c46134f1",
                        "snaks": {
                            "P143": [{
                                "snaktype": "value",
                                "property": "P143",
                                "datavalue": {
                                    "value": {
                                        "entity-type": "item",
                                        "numeric-id": 328,
                                        "id": "Q328"
                                    },
                                    "type": "wikibase-entityid"
                                },
                                "datatype": "wikibase-item"
                            }]
                        },
                        "snaks-order": ["P143"]
                    }]
                }],
                "P957": [{
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P957",
                        "datavalue": {
                            "value": "2-07-031490-1",
                            "type": "string"
                        },
                        "datatype": "external-id"
                    },
                    "type": "statement",
                    "id": "Q753535$1A821766-DDAF-4185-BB09-FCCEEBA8C0EC",
                    "rank": "normal"
                }],
                "P136": [{
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P136",
                        "datavalue": {
                            "value": {
                                "entity-type": "item",
                                "numeric-id": 8261,
                                "id": "Q8261"
                            },
                            "type": "wikibase-entityid"
                        },
                        "datatype": "wikibase-item"
                    },
                    "type": "statement",
                    "id": "Q753535$63866EAC-F73C-4B89-9211-7771FE802D7B",
                    "rank": "normal"
                }, {
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P136",
                        "datavalue": {
                            "value": {
                                "entity-type": "item",
                                "numeric-id": 11452132,
                                "id": "Q11452132"
                            },
                            "type": "wikibase-entityid"
                        },
                        "datatype": "wikibase-item"
                    },
                    "type": "statement",
                    "id": "Q753535$E7AFFA14-BD4A-4F6A-BDE6-1BE66555758E",
                    "rank": "normal",
                    "references": [{
                        "hash": "fa278ebfc458360e5aed63d5058cca83c46134f1",
                        "snaks": {
                            "P143": [{
                                "snaktype": "value",
                                "property": "P143",
                                "datavalue": {
                                    "value": {
                                        "entity-type": "item",
                                        "numeric-id": 328,
                                        "id": "Q328"
                                    },
                                    "type": "wikibase-entityid"
                                },
                                "datatype": "wikibase-item"
                            }]
                        },
                        "snaks-order": ["P143"]
                    }]
                }],
                "P577": [{
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P577",
                        "datavalue": {
                            "value": {
                                "time": "+1610-00-00T00:00:00Z",
                                "timezone": 0,
                                "before": 0,
                                "after": 0,
                                "precision": 9,
                                "calendarmodel": "http://www.wikidata.org/entity/Q1985727"
                            },
                            "type": "time"
                        },
                        "datatype": "time"
                    },
                    "type": "statement",
                    "id": "Q753535$b7a4cc71-4f11-0068-df03-a566332ebf45",
                    "rank": "normal"
                }],
                "P18": [{
                    "mainsnak": {
                        "snaktype": "value",
                        "property": "P18",
                        "datavalue": {
                            "value": "Jin Ping Mei.jpg",
                            "type": "string"
                        },
                        "datatype": "commonsMedia"
                    },
                    "type": "statement",
                    "id": "Q753535$B4550D8C-62D4-4868-8020-23973E43313A",
                    "rank": "normal",
                    "references": [{
                        "hash": "dbc378ca38aa06d2ac2a1b790ffe3a5ee9d0bd17",
                        "snaks": {
                            "P143": [{
                                "snaktype": "value",
                                "property": "P143",
                                "datavalue": {
                                    "value": {
                                        "entity-type": "item",
                                        "numeric-id": 53464,
                                        "id": "Q53464"
                                    },
                                    "type": "wikibase-entityid"
                                },
                                "datatype": "wikibase-item"
                            }]
                        },
                        "snaks-order": ["P143"]
                    }]
                }]
            },
            "sitelinks": {
                "enwiki": {
                    "site": "enwiki",
                    "title": "Jin Ping Mei",
                    "badges": [],
                    "url": "https://en.wikipedia.org/wiki/Jin_Ping_Mei"
                },
                "enwikiquote": {
                    "site": "enwikiquote",
                    "title": "Jin Ping Mei",
                    "badges": [],
                    "url": "https://en.wikiquote.org/wiki/Jin_Ping_Mei"
                },
                "zhwiki": {
                    "site": "zhwiki",
                    "title": "金瓶梅",
                    "badges": [],
                    "url": "https://zh.wikipedia.org/wiki/%E9%87%91%E7%93%B6%E6%A2%85"
                },
                "zhwikisource": {
                    "site": "zhwikisource",
                    "title": "金瓶梅",
                    "badges": [],
                    "url": "https://zh.wikisource.org/wiki/%E9%87%91%E7%93%B6%E6%A2%85"
                }
            }
        }
    }
}

如果需要更仔细的控制,如只要求部分语言和属性,又或一次性取得多个实体的信息,可以参考Wikidata API的文档,比如说https://www.wikidata.org/w/api.php?action=wbgetentities&ids=Q1&props=labels|aliases|descriptions&languages=en|zh&format=json可取得实体Q1在中文和英文的名称、别名和描述:

{
    "entities": {
        "Q1": {
            "type": "item",
            "id": "Q1",
            "labels": {
                "zh": {
                    "language": "zh",
                    "value": "\u5b87\u5b99"
                },
                "en": {
                    "language": "en",
                    "value": "Universe"
                }
            },
            "descriptions": {
                "zh": {
                    "language": "zh",
                    "value": "\u4e00\u5207\u7a7a\u95f4\u3001\u65f6\u95f4\u3001\u7269\u8d28\u548c\u80fd\u91cf\u6784\u6210\u7684\u603b\u4f53"
                },
                "en": {
                    "language": "en",
                    "value": "totality of space and all contents"
                }
            },
            "aliases": {
                "en": [
                    {
                        "language": "en",
                        "value": "Our Universe"
                    },
                    {
                        "language": "en",
                        "value": "The Universe"
                    },
                    {
                        "language": "en",
                        "value": "Universe (Ours)"
                    },
                    {
                        "language": "en",
                        "value": "The Cosmos"
                    },
                    {
                        "language": "en",
                        "value": "cosmos"
                    }
                ]
            }
        }
    },
    "success": 1
}

同样地,搜索也可用API,如https://www.wikidata.org/w/api.php?action=wbsearchentities&search=apache&language=en&limit=2&format=json可搜索英文词”apache“并返回最多两个结果:

{
    "searchinfo": {
        "search": "apache"
    },
    "search": [
        {
            "repository": "",
            "id": "Q102090",
            "conceptIRI": "http://www.wikidata.org/entity/Q102090",
            "title": "Q102090",
            "pageid": 104694,
            "url": "//www.wikidata.org/wiki/Q102090",
            "label": "Apache",
            "description": "several culturally related groups of Native Americans in the United States",
            "match": {
                "type": "label",
                "language": "en",
                "text": "Apache"
            }
        },
        {
            "repository": "",
            "id": "Q11354",
            "conceptIRI": "http://www.wikidata.org/entity/Q11354",
            "title": "Q11354",
            "pageid": 12854,
            "url": "//www.wikidata.org/wiki/Q11354",
            "label": "Apache HTTP Server",
            "description": "open-source web server software",
            "match": {
                "type": "label",
                "language": "en",
                "text": "Apache HTTP Server"
            }
        }
    ],
    "search-continue": 2,
    "success": 1
}

发掘语义网

上面介绍了获取关于一个实体的信息的方法,但现实中我们通常想要的是关于满足一定条件的所有实体的信息。虽然把Wikidata中所有数据从https://dumps.wikimedia.org/wikidatawiki/entities/下载下来是可能的,但下载至少几十GB的压缩包下来本地处理对于许多用途而言太笨重了,而且如果不定期与上游同步(如下载最近30天的变更)数据就会开始变得过时。为此,Wikidata提供了一个查询服务https://query.wikidata.org/,大家可以在浏览器使用其IDE编辑查询,也可以写程序向URLhttps://query.wikidata.org/sparql?query=经编码的查询发送GETPOST请求,返回结果的格式可用HTTPAccept: 的值application/sparql-results+xmlapplication/sparql-results+jsontext/tab-separated-valuestext/csvapplication/x-binary-rdf-results-table决定。

这个查询服务接受的查询语言是SPARQL,与受Prolog启发的SQL高度类似。为了尽早熟悉这语言,不妨先尝试https://query.wikidata.org/页面提供的示例,然后可以尝试调整过滤器看看结果有什么变化。接着可以从空白查询出发通过新增过滤器建立查询,例如先输入”黄霑“选取Q709317作为客体,再选择属性作词者(P676)就会得到查询(含义是“哪些歌的作词人是黄霑?”):

SELECT ?__ ?__Label WHERE {
  SERVICE wikibase:label { bd:serviceParam wikibase:language "[AUTO_LANGUAGE],en". }
  ?__ wdt:P676 wd:Q709317.
}
LIMIT 100

提交查询就会得到结果如:

__ __Label
wd:Q1743796 我的中國心
wd:Q10867969 上海灘
wd:Q15909571 獅子山下

SPARQL

以下我们说明怎么写SPARQL查询。

简单查询

如前所述,一个断言本质上是由主体、属性谓词和客体组成的三元组,它们三者通常都是用IRI表示的实体,因而一个典型的断言形如http://www.wikidata.org/entity/Q444381 http://www.wikidata.org/prop/direct/P50 http://www.wikidata.org/entity/Q23114(这里http://www.wikidata.org/entity/Q444381表示《孔乙己》、http://www.wikidata.org/prop/direct/P50表示属性“作者是”、http://www.wikidata.org/entity/Q23114表示鲁迅,因而这断言的意思是“《孔乙己》的作者是鲁迅”)。由于每次都使用完整的IRI太冗长,我们通常把常用的前缀如http://www.wikidata.org/entity/http://www.wikidata.org/prop/direct/分别简记为wd:wdt:,于是上述断言可以简化为wd:Q444381 wdt:P50 wd:Q23114。当然,我们也可以自定义其它名称空间,如在查询前加上PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>后查询中就可用rdfs:代替IRI前缀http://www.w3.org/2000/01/rdf-schema#

现在我们想在已知断言的一部分的前提下找出断言的其余部分,于是我们把断言中未知的部分用?变量名代替(变量名是我们选取的,不同变量名代表不同的未知量,如果不关心变量值也可用_:变量名),然后在前面加上SELECT * WHERE{而在后面加上}就得到查询,比如说:

查询 效果
SELECT * WHERE {?work wdt:P50 wd:Q23114} 查找鲁迅有哪些作品
SELECT * WHERE {wd:Q444381 ?relationship wd:Q23114} 查找《孔乙己》与鲁迅有什么关系
SELECT * WHERE {wd:Q444381 wdt:P50 ?author} 查找《孔乙己》的作者是谁
SELECT * WHERE {?work wdt:P50 ?author} 查找所有作品和它们分别的作者是谁

尝试这些查询会分别得到一个表格,表格的各个列分别表示各未知变量的值而各行表示不同断言。不过,由于表格中的值都是实体的IRI,我们并不知道它们分别代表什么,我们更想要的是实体的名称。由于实体到它的名称通过属性谓词rdfs:label给出(顺便一提,别名由skos:altLabel给出、描述由schema:description给出),我们把查询修改为SELECT * WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName}(这里两个元组用句点.分隔)。读者可能已经看出这相当于Prolog中的合一或SQL中的表连接,现在我们得到了鲁迅各作品的IRI和对应的名称。

不过,我们注意到同一作品在不同语言中名称都在不同行出来了,而我们只对中文名称感兴趣。由于语义网中的字符串是带语言标签的(如英文字符串表示为类似"Hello world"@en、繁体中文字符串表示为类似"軟體"@zh-hant),于是我们把查询改为SELECT * WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. FILTER(LANG(?workName)='zh')}

假如我们还想要作品的英文名称,我们自然会写出SELECT * WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. FILTER(LANG(?workName)='zh'). ?work rdfs:label ?workNameEn. FILTER(LANG(?workNameEn)='en')},但你会看到结果变少了,原因是许多作品没有英文名。如果我们只要求在有英文名时返回英文名,则可把查询改为:

SELECT * WHERE {
  ?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. 
  FILTER(LANG(?workName)='zh').
  OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
}

假如我们对实体IRI完全不感兴趣,可以把查询改为SELECT ?workName WHERE {?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. FILTER(LANG(?workName)='zh')},这样结果中只会有?workName一列,就像SQL中的投影。顺便一提,连接再投影后可能有重复结果,例如用SELECT ?coAuthorName WHERE {?work wdt:P50 wd:Q173746.?work wdt:P50 ?coAuthor.FILTER(?coAuthor != wd:Q173746)?coAuthor rdfs:label ?coAuthorName.FILTER(LANG(?coAuthorName)='en')}查询埃尔德什的合著者英文姓名的名单时,结果中有重复是由于有人与他合作多次的结果。要消除重复只用在SELECT后加上DISTINCT即可(如果我们只容许但不要求消除重复,则改用REDUCED)。

反过来,我们可能想要在结果可增加列,新列的值由原有列的值计算得到。例如我们可能想把英文标题变成大写:

SELECT ?workName ?workNameEnCap WHERE {
  ?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. 
  FILTER(LANG(?workName)='zh').
  OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
  BIND(UCASE(?workNameEn) AS ?workNameEnCap)
}

一般来说结果集是无序的,但我们可以要求按变量值或更一般表达式的值排序,例如以下按英文名倒序排(排序则把DESC改为ASC):

SELECT ?workName ?workNameEnCap WHERE {
  ?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. 
  FILTER(LANG(?workName)='zh').
  OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
  BIND(UCASE(?workNameEn) AS ?workNameEnCap)
}ORDER BY DESC(?workNameEnCap)

有时候结果多了看不过来,这就需要对结果集分页。如只要求返回第10个结果开始的5个结果:

SELECT ?workName ?workNameEnCap WHERE {
  ?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName. 
  FILTER(LANG(?workName)='zh').
  OPTIONAL{?work rdfs:label ?workNameEn.FILTER(LANG(?workNameEn)='en')}
  BIND(UCASE(?workNameEn) AS ?workNameEnCap)
}ORDER BY DESC(?workNameEnCap)
OFFSET 10 LIMIT 5

SPARQL还提供了一些语法糖让查询写起来更紧凑一点,以下举例说明:

片段 简化 附注
?work wdt:P50 wd:Q23114. ?work rdfs:label ?workName ?work wdt:P50 wd:Q23114; rdfs:label ?workName 相同主体的三元组
?work rdfs:label ?workName. ?work rdfs:label ?workNameEn ?work rdfs:label ?workName,?workNameEn 相同主体和谓词的元组
wd:Q444381 wdt:P50 _:author._:author wdt:P1412 ?language wd:Q444381 wdt:P50/wdt:P1412 ?language 间接关系

聚合查询

与SQL类似,SPARQL也有聚合子句。比如我们想查询鲁迅不同年份的作品数,可以使用以下查询:

SELECT ?year (COUNT(?work) AS ?count) WHERE {
  ?work wdt:P50 wd:Q23114. 
  ?work wdt:P577 ?date. 
}GROUP BY (YEAR(?date) AS ?year)
ORDER BY ?year

我们还可以过滤分组,比如只要作品数多于一的年份:

SELECT ?year (COUNT(?work) AS ?count) WHERE {
  ?work wdt:P50 wd:Q23114. 
  ?work wdt:P577 ?date. 
}GROUP BY (YEAR(?date) AS ?year)
HAVING (COUNT(?work)>1)
ORDER BY ?year

除了COUNT外,还有MINMAXAVGSAMPLEGROUP_CONCAT可作聚合函数,需要去重复再聚合的话可以在(后加上DISTINCT

完整的查询语法

有了前面的例子,我们已经了解SPARQL中最常用的SELECT语句的结构,另外还有用于判断结果集是否非空但不返回结果集的ASK语句、用于用结果集构造图的CONSTRUCT语句和返回实体全部属性的DESCRIBE语句。以下给出形式化的定义(关键词除用于表示类型谓词的a外都不区分大小写):

[1]  	QueryUnit	  ::=  	Query
[2]  	Query	  ::=  	Prologue
( SelectQuery | ConstructQuery | DescribeQuery | AskQuery )
ValuesClause
[4]  	Prologue	  ::=  	( BaseDecl | PrefixDecl )*
[5]  	BaseDecl	  ::=  	'BASE' IRIREF
[6]  	PrefixDecl	  ::=  	'PREFIX' PNAME_NS IRIREF
[7]  	SelectQuery	  ::=  	SelectClause DatasetClause* WhereClause SolutionModifier
[8]  	SubSelect	  ::=  	SelectClause WhereClause SolutionModifier ValuesClause
[9]  	SelectClause	  ::=  	'SELECT' ( 'DISTINCT' | 'REDUCED' )? ( ( Var | ( '(' Expression 'AS' Var ')' ) )+ | '*' )
[10]  	ConstructQuery	  ::=  	'CONSTRUCT' ( ConstructTemplate DatasetClause* WhereClause SolutionModifier | DatasetClause* 'WHERE' '{' TriplesTemplate? '}' SolutionModifier )
[11]  	DescribeQuery	  ::=  	'DESCRIBE' ( VarOrIri+ | '*' ) DatasetClause* WhereClause? SolutionModifier
[12]  	AskQuery	  ::=  	'ASK' DatasetClause* WhereClause SolutionModifier
[13]  	DatasetClause	  ::=  	'FROM' ( DefaultGraphClause | NamedGraphClause )
[14]  	DefaultGraphClause	  ::=  	SourceSelector
[15]  	NamedGraphClause	  ::=  	'NAMED' SourceSelector
[16]  	SourceSelector	  ::=  	iri
[17]  	WhereClause	  ::=  	'WHERE'? GroupGraphPattern
[18]  	SolutionModifier	  ::=  	GroupClause? HavingClause? OrderClause? LimitOffsetClauses?
[19]  	GroupClause	  ::=  	'GROUP' 'BY' GroupCondition+
[20]  	GroupCondition	  ::=  	BuiltInCall | FunctionCall | '(' Expression ( 'AS' Var )? ')' | Var
[21]  	HavingClause	  ::=  	'HAVING' HavingCondition+
[22]  	HavingCondition	  ::=  	Constraint
[23]  	OrderClause	  ::=  	'ORDER' 'BY' OrderCondition+
[24]  	OrderCondition	  ::=  	( ( 'ASC' | 'DESC' ) BrackettedExpression )
| ( Constraint | Var )
[25]  	LimitOffsetClauses	  ::=  	LimitClause OffsetClause? | OffsetClause LimitClause?
[26]  	LimitClause	  ::=  	'LIMIT' INTEGER
[27]  	OffsetClause	  ::=  	'OFFSET' INTEGER
[28]  	ValuesClause	  ::=  	( 'VALUES' DataBlock )?
[52]  	TriplesTemplate	  ::=  	TriplesSameSubject ( '.' TriplesTemplate? )?

其中用于过滤断言的WHERE子句的组成如下:

[53]  	GroupGraphPattern	  ::=  	'{' ( SubSelect | GroupGraphPatternSub ) '}'
[54]  	GroupGraphPatternSub	  ::=  	TriplesBlock? ( GraphPatternNotTriples '.'? TriplesBlock? )*
[55]  	TriplesBlock	  ::=  	TriplesSameSubjectPath ( '.' TriplesBlock? )?
[56]  	GraphPatternNotTriples	  ::=  	GroupOrUnionGraphPattern | OptionalGraphPattern | MinusGraphPattern | GraphGraphPattern | ServiceGraphPattern | Filter | Bind | InlineData
[57]  	OptionalGraphPattern	  ::=  	'OPTIONAL' GroupGraphPattern
[58]  	GraphGraphPattern	  ::=  	'GRAPH' VarOrIri GroupGraphPattern
[59]  	ServiceGraphPattern	  ::=  	'SERVICE' 'SILENT'? VarOrIri GroupGraphPattern
[60]  	Bind	  ::=  	'BIND' '(' Expression 'AS' Var ')'
[61]  	InlineData	  ::=  	'VALUES' DataBlock
[62]  	DataBlock	  ::=  	InlineDataOneVar | InlineDataFull
[63]  	InlineDataOneVar	  ::=  	Var '{' DataBlockValue* '}'
[64]  	InlineDataFull	  ::=  	( NIL | '(' Var* ')' ) '{' ( '(' DataBlockValue* ')' | NIL )* '}'
[65]  	DataBlockValue	  ::=  	iri | RDFLiteral | NumericLiteral | BooleanLiteral | 'UNDEF'
[66]  	MinusGraphPattern	  ::=  	'MINUS' GroupGraphPattern
[67]  	GroupOrUnionGraphPattern	  ::=  	GroupGraphPattern ( 'UNION' GroupGraphPattern )*
[68]  	Filter	  ::=  	'FILTER' Constraint
[69]  	Constraint	  ::=  	BrackettedExpression | BuiltInCall | FunctionCall
[70]  	FunctionCall	  ::=  	iri ArgList
[71]  	ArgList	  ::=  	NIL | '(' 'DISTINCT'? Expression ( ',' Expression )* ')'
[72]  	ExpressionList	  ::=  	NIL | '(' Expression ( ',' Expression )* ')'
[73]  	ConstructTemplate	  ::=  	'{' ConstructTriples? '}'
[74]  	ConstructTriples	  ::=  	TriplesSameSubject ( '.' ConstructTriples? )?
[75]  	TriplesSameSubject	  ::=  	VarOrTerm PropertyListNotEmpty | TriplesNode PropertyList
[76]  	PropertyList	  ::=  	PropertyListNotEmpty?
[77]  	PropertyListNotEmpty	  ::=  	Verb ObjectList ( ';' ( Verb ObjectList )? )*
[78]  	Verb	  ::=  	VarOrIri | 'a'
[79]  	ObjectList	  ::=  	Object ( ',' Object )*
[80]  	Object	  ::=  	GraphNode
[81]  	TriplesSameSubjectPath	  ::=  	VarOrTerm PropertyListPathNotEmpty | TriplesNodePath PropertyListPath
[82]  	PropertyListPath	  ::=  	PropertyListPathNotEmpty?
[83]  	PropertyListPathNotEmpty	  ::=  	( VerbPath | VerbSimple ) ObjectListPath ( ';' ( ( VerbPath | VerbSimple ) ObjectList )? )*
[84]  	VerbPath	  ::=  	Path
[85]  	VerbSimple	  ::=  	Var
[86]  	ObjectListPath	  ::=  	ObjectPath ( ',' ObjectPath )*
[87]  	ObjectPath	  ::=  	GraphNodePath
[88]  	Path	  ::=  	PathAlternative
[89]  	PathAlternative	  ::=  	PathSequence ( '|' PathSequence )*
[90]  	PathSequence	  ::=  	PathEltOrInverse ( '/' PathEltOrInverse )*
[91]  	PathElt	  ::=  	PathPrimary PathMod?
[92]  	PathEltOrInverse	  ::=  	PathElt | '^' PathElt
[93]  	PathMod	  ::=  	'?' | '*' | '+'
[94]  	PathPrimary	  ::=  	iri | 'a' | '!' PathNegatedPropertySet | '(' Path ')'
[95]  	PathNegatedPropertySet	  ::=  	PathOneInPropertySet | '(' ( PathOneInPropertySet ( '|' PathOneInPropertySet )* )? ')'
[96]  	PathOneInPropertySet	  ::=  	iri | 'a' | '^' ( iri | 'a' )
[97]  	Integer	  ::=  	INTEGER
[98]  	TriplesNode	  ::=  	Collection | BlankNodePropertyList
[99]  	BlankNodePropertyList	  ::=  	'[' PropertyListNotEmpty ']'
[100]  	TriplesNodePath	  ::=  	CollectionPath | BlankNodePropertyListPath
[101]  	BlankNodePropertyListPath	  ::=  	'[' PropertyListPathNotEmpty ']'
[102]  	Collection	  ::=  	'(' GraphNode+ ')'
[103]  	CollectionPath	  ::=  	'(' GraphNodePath+ ')'
[104]  	GraphNode	  ::=  	VarOrTerm | TriplesNode
[105]  	GraphNodePath	  ::=  	VarOrTerm | TriplesNodePath
[106]  	VarOrTerm	  ::=  	Var | GraphTerm
[107]  	VarOrIri	  ::=  	Var | iri
[108]  	Var	  ::=  	VAR1 | VAR2
[109]  	GraphTerm	  ::=  	iri | RDFLiteral | NumericLiteral | BooleanLiteral | BlankNode | NIL

接下来是表达式语法,虽然看起来可能有点复杂,但由于语法和语义上都与其它常见语言差不多,没有必要作太多讲解。

[110]  	Expression	  ::=  	ConditionalOrExpression
[111]  	ConditionalOrExpression	  ::=  	ConditionalAndExpression ( '||' ConditionalAndExpression )*
[112]  	ConditionalAndExpression	  ::=  	ValueLogical ( '&&' ValueLogical )*
[113]  	ValueLogical	  ::=  	RelationalExpression
[114]  	RelationalExpression	  ::=  	NumericExpression ( '=' NumericExpression | '!=' NumericExpression | '<' NumericExpression | '>' NumericExpression | '<=' NumericExpression | '>=' NumericExpression | 'IN' ExpressionList | 'NOT' 'IN' ExpressionList )?
[115]  	NumericExpression	  ::=  	AdditiveExpression
[116]  	AdditiveExpression	  ::=  	MultiplicativeExpression ( '+' MultiplicativeExpression | '-' MultiplicativeExpression | ( NumericLiteralPositive | NumericLiteralNegative ) ( ( '*' UnaryExpression ) | ( '/' UnaryExpression ) )* )*
[117]  	MultiplicativeExpression	  ::=  	UnaryExpression ( '*' UnaryExpression | '/' UnaryExpression )*
[118]  	UnaryExpression	  ::=  	  '!' PrimaryExpression
| '+' PrimaryExpression
| '-' PrimaryExpression
| PrimaryExpression
[119]  	PrimaryExpression	  ::=  	BrackettedExpression | BuiltInCall | iriOrFunction | RDFLiteral | NumericLiteral | BooleanLiteral | Var
[120]  	BrackettedExpression	  ::=  	'(' Expression ')'
[121]  	BuiltInCall	  ::=  	  Aggregate
| 'STR' '(' Expression ')'
| 'LANG' '(' Expression ')'
| 'LANGMATCHES' '(' Expression ',' Expression ')'
| 'DATATYPE' '(' Expression ')'
| 'BOUND' '(' Var ')'
| 'IRI' '(' Expression ')'
| 'IRI' '(' Expression ')'
| 'BNODE' ( '(' Expression ')' | NIL )
| 'RAND' NIL
| 'ABS' '(' Expression ')'
| 'CEIL' '(' Expression ')'
| 'FLOOR' '(' Expression ')'
| 'ROUND' '(' Expression ')'
| 'CONCAT' ExpressionList
| SubstringExpression
| 'STRLEN' '(' Expression ')'
| StrReplaceExpression
| 'UCASE' '(' Expression ')'
| 'LCASE' '(' Expression ')'
| 'ENCODE_FOR_IRI' '(' Expression ')'
| 'CONTAINS' '(' Expression ',' Expression ')'
| 'STRSTARTS' '(' Expression ',' Expression ')'
| 'STRENDS' '(' Expression ',' Expression ')'
| 'STRBEFORE' '(' Expression ',' Expression ')'
| 'STRAFTER' '(' Expression ',' Expression ')'
| 'YEAR' '(' Expression ')'
| 'MONTH' '(' Expression ')'
| 'DAY' '(' Expression ')'
| 'HOURS' '(' Expression ')'
| 'MINUTES' '(' Expression ')'
| 'SECONDS' '(' Expression ')'
| 'TIMEZONE' '(' Expression ')'
| 'TZ' '(' Expression ')'
| 'NOW' NIL
| 'UUID' NIL
| 'STRUUID' NIL
| 'MD5' '(' Expression ')'
| 'SHA1' '(' Expression ')'
| 'SHA256' '(' Expression ')'
| 'SHA384' '(' Expression ')'
| 'SHA512' '(' Expression ')'
| 'COALESCE' ExpressionList
| 'IF' '(' Expression ',' Expression ',' Expression ')'
| 'STRLANG' '(' Expression ',' Expression ')'
| 'STRDT' '(' Expression ',' Expression ')'
| 'sameTerm' '(' Expression ',' Expression ')'
| 'isIRI' '(' Expression ')'
| 'isIRI' '(' Expression ')'
| 'isBLANK' '(' Expression ')'
| 'isLITERAL' '(' Expression ')'
| 'isNUMERIC' '(' Expression ')'
| RegexExpression
| ExistsFunc
| NotExistsFunc
[122]  	RegexExpression	  ::=  	'REGEX' '(' Expression ',' Expression ( ',' Expression )? ')'
[123]  	SubstringExpression	  ::=  	'SUBSTR' '(' Expression ',' Expression ( ',' Expression )? ')'
[124]  	StrReplaceExpression	  ::=  	'REPLACE' '(' Expression ',' Expression ',' Expression ( ',' Expression )? ')'
[125]  	ExistsFunc	  ::=  	'EXISTS' GroupGraphPattern
[126]  	NotExistsFunc	  ::=  	'NOT' 'EXISTS' GroupGraphPattern
[127]  	Aggregate	  ::=  	  'COUNT' '(' 'DISTINCT'? ( '*' | Expression ) ')'
| 'SUM' '(' 'DISTINCT'? Expression ')'
| 'MIN' '(' 'DISTINCT'? Expression ')'
| 'MAX' '(' 'DISTINCT'? Expression ')'
| 'AVG' '(' 'DISTINCT'? Expression ')'
| 'SAMPLE' '(' 'DISTINCT'? Expression ')'
| 'GROUP_CONCAT' '(' 'DISTINCT'? Expression ( ';' 'SEPARATOR' '=' String )? ')'
[128]  	iriOrFunction	  ::=  	iri ArgList?
[129]  	RDFLiteral	  ::=  	String ( LANGTAG | ( '^^' iri ) )?
[130]  	NumericLiteral	  ::=  	NumericLiteralUnsigned | NumericLiteralPositive | NumericLiteralNegative
[131]  	NumericLiteralUnsigned	  ::=  	INTEGER | DECIMAL | DOUBLE
[132]  	NumericLiteralPositive	  ::=  	INTEGER_POSITIVE | DECIMAL_POSITIVE | DOUBLE_POSITIVE
[133]  	NumericLiteralNegative	  ::=  	INTEGER_NEGATIVE | DECIMAL_NEGATIVE | DOUBLE_NEGATIVE
[134]  	BooleanLiteral	  ::=  	'true' | 'false'
[135]  	String	  ::=  	STRING_LITERAL1 | STRING_LITERAL2 | STRING_LITERAL_LONG1 | STRING_LITERAL_LONG2
[136]  	iri	  ::=  	IRIREF | PrefixedName
[137]  	PrefixedName	  ::=  	PNAME_LN | PNAME_NS
[138]  	BlankNode	  ::=  	BLANK_NODE_LABEL | ANON

最后来到最基本的词法。SPARQL的单词可分为:

  • IRI。
    • <>包围的IRI。对于相对IRI,会基于BaseDecl中给出的IRI解析。
    • 形如名字空间:本地名,表示由名字空间表示的前缀后紧接本地名,两个部分都可以省略。
  • 字面值。用成对的'"'''"""包围的文本,其中可用常见的反斜杠转义序列。字符串后可接@和语言标签来指定语言,或者接^^和用IRI表示的数据类型。对于一些常见类型有缩写,比如:
    • 1相当于"1"^^xsd:integer
    • 1.3相当于"1.3"^^xsd:decimal
    • 1.0e6相当于"1.0e6"^^xsd:double
    • true相当于"true"^^xsd:boolean
    • false相当于"false"^^xsd:boolean
  • 空白结点。形如_:标签[],不同标签空白结点类似于不同的变量,可用于匹配但不需要引用值的场合。
  • 变量。由$?开首接变量名组成,其中变量名用一般非特殊字符都可以,至少字母、数字、下划线总是容许的,包括多数Unicode字符。
  • 行末注释由#开始,会被忽略。
[139]  	IRIREF	  ::=  	'<' ([^<>"{}|^`\]-[#x00-#x20])* '>'
[140]  	PNAME_NS	  ::=  	PN_PREFIX? ':'
[141]  	PNAME_LN	  ::=  	PNAME_NS PN_LOCAL
[142]  	BLANK_NODE_LABEL	  ::=  	'_:' ( PN_CHARS_U | [0-9] ) ((PN_CHARS|'.')* PN_CHARS)?
[143]  	VAR1	  ::=  	'?' VARNAME
[144]  	VAR2	  ::=  	'$' VARNAME
[145]  	LANGTAG	  ::=  	'@' [a-zA-Z]+ ('-' [a-zA-Z0-9]+)*
[146]  	INTEGER	  ::=  	[0-9]+
[147]  	DECIMAL	  ::=  	[0-9]* '.' [0-9]+
[148]  	DOUBLE	  ::=  	[0-9]+ '.' [0-9]* EXPONENT | '.' ([0-9])+ EXPONENT | ([0-9])+ EXPONENT
[149]  	INTEGER_POSITIVE	  ::=  	'+' INTEGER
[150]  	DECIMAL_POSITIVE	  ::=  	'+' DECIMAL
[151]  	DOUBLE_POSITIVE	  ::=  	'+' DOUBLE
[152]  	INTEGER_NEGATIVE	  ::=  	'-' INTEGER
[153]  	DECIMAL_NEGATIVE	  ::=  	'-' DECIMAL
[154]  	DOUBLE_NEGATIVE	  ::=  	'-' DOUBLE
[155]  	EXPONENT	  ::=  	[eE] [+-]? [0-9]+
[156]  	STRING_LITERAL1	  ::=  	"'" ( ([^#x27#x5C#xA#xD]) | ECHAR )* "'"
[157]  	STRING_LITERAL2	  ::=  	'"' ( ([^#x22#x5C#xA#xD]) | ECHAR )* '"'
[158]  	STRING_LITERAL_LONG1	  ::=  	"'''" ( ( "'" | "''" )? ( [^'\] | ECHAR ) )* "'''"
[159]  	STRING_LITERAL_LONG2	  ::=  	'"""' ( ( '"' | '""' )? ( [^"\] | ECHAR ) )* '"""'
[160]  	ECHAR	  ::=  	'\' [tbnrf\"']
[161]  	NIL	  ::=  	'(' WS* ')'
[162]  	WS	  ::=  	#x20 | #x9 | #xD | #xA
[163]  	ANON	  ::=  	'[' WS* ']'
[164]  	PN_CHARS_BASE	  ::=  	[A-Z] | [a-z] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x02FF] | [#x0370-#x037D] | [#x037F-#x1FFF] | [#x200C-#x200D] | [#x2070-#x218F] | [#x2C00-#x2FEF] | [#x3001-#xD7FF] | [#xF900-#xFDCF] | [#xFDF0-#xFFFD] | [#x10000-#xEFFFF]
[165]  	PN_CHARS_U	  ::=  	PN_CHARS_BASE | '_'
[166]  	VARNAME	  ::=  	( PN_CHARS_U | [0-9] ) ( PN_CHARS_U | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040] )*
[167]  	PN_CHARS	  ::=  	PN_CHARS_U | '-' | [0-9] | #x00B7 | [#x0300-#x036F] | [#x203F-#x2040]
[168]  	PN_PREFIX	  ::=  	PN_CHARS_BASE ((PN_CHARS|'.')* PN_CHARS)?
[169]  	PN_LOCAL	  ::=  	(PN_CHARS_U | ':' | [0-9] | PLX ) ((PN_CHARS | '.' | ':' | PLX)* (PN_CHARS | ':' | PLX) )?
[170]  	PLX	  ::=  	PERCENT | PN_LOCAL_ESC
[171]  	PERCENT	  ::=  	'%' HEX HEX
[172]  	HEX	  ::=  	[0-9] | [A-F] | [a-f]
[173]  	PN_LOCAL_ESC	  ::=  	'\' ( '_' | '~' | '.' | '-' | '!' | '$' | '&' | "'" | '(' | ')' | '*' | '+' | ',' | ';' | '=' | '/' | '?' | '#' | '@' | '%' )

其实,SPARQL的完整语法还支持修改数据:

[3]  	UpdateUnit	  ::=  	Update
[29]  	Update	  ::=  	Prologue ( Update1 ( ';' Update )? )?
[30]  	Update1	  ::=  	Load | Clear | Drop | Add | Move | Copy | Create | InsertData | DeleteData | DeleteWhere | Modify
[31]  	Load	  ::=  	'LOAD' 'SILENT'? iri ( 'INTO' GraphRef )?
[32]  	Clear	  ::=  	'CLEAR' 'SILENT'? GraphRefAll
[33]  	Drop	  ::=  	'DROP' 'SILENT'? GraphRefAll
[34]  	Create	  ::=  	'CREATE' 'SILENT'? GraphRef
[35]  	Add	  ::=  	'ADD' 'SILENT'? GraphOrDefault 'TO' GraphOrDefault
[36]  	Move	  ::=  	'MOVE' 'SILENT'? GraphOrDefault 'TO' GraphOrDefault
[37]  	Copy	  ::=  	'COPY' 'SILENT'? GraphOrDefault 'TO' GraphOrDefault
[38]  	InsertData	  ::=  	'INSERT DATA' QuadData
[39]  	DeleteData	  ::=  	'DELETE DATA' QuadData
[40]  	DeleteWhere	  ::=  	'DELETE WHERE' QuadPattern
[41]  	Modify	  ::=  	( 'WITH' iri )? ( DeleteClause InsertClause? | InsertClause ) UsingClause* 'WHERE' GroupGraphPattern
[42]  	DeleteClause	  ::=  	'DELETE' QuadPattern
[43]  	InsertClause	  ::=  	'INSERT' QuadPattern
[44]  	UsingClause	  ::=  	'USING' ( iri | 'NAMED' iri )
[45]  	GraphOrDefault	  ::=  	'DEFAULT' | 'GRAPH'? iri
[46]  	GraphRef	  ::=  	'GRAPH' iri
[47]  	GraphRefAll	  ::=  	GraphRef | 'DEFAULT' | 'NAMED' | 'ALL'
[48]  	QuadPattern	  ::=  	'{' Quads '}'
[49]  	QuadData	  ::=  	'{' Quads '}'
[50]  	Quads	  ::=  	TriplesTemplate? ( QuadsNotTriples '.'? TriplesTemplate? )*
[51]  	QuadsNotTriples	  ::=  	'GRAPH' VarOrIri '{' TriplesTemplate? '}'

结语

Wikidata作为一个协作项目,我们在从中获益的同时也理应积极向它捐赠数据,成为一个负责任的公民。例如开发基于它的应用时不妨加入容许用户纠错的机制,并把经审查的纠错信息反馈给上游。

关键词 人工智能