Graph Query Language - the new kid on the block!

Alex Miłowski

Abstract

There is a new kid on the block and they may have stolen your lunch! Or maybe you’ll make a new friend? In April of this year, ISO/IEC JTC 1/SC 32, the same group that publishes and maintains SQL, published a new query language: ISO/IEC 39075:2024 GQL. This new standard defines the property graph data model and syntax for querying and manipulating graphs. What is a property graph? What does a GQL query look like? Where might you find and use GQL? We’ll explore a short history, the basic aspects of this new standard, show some examples, and consider future applications.

Historical context

In 2019, the W3C held a workshop called Web Standardization for Graph Data [1] with the intent of bringing together various parties working on different aspects of graph data and databases on and off the Web. Later that same year, ISO established GQL as an official ISO project to be accomplished by ISO/IEC JTC 1/SC 32 [3] committee that is also the same committee that publishes and maintains the SQL query language. While the attempt at unification by the W3C may not have succeeded, the need and want to address standardization of database query languages for graphs had already been well established.

Graphs have long been used in mathematics and computer science. Alongside, database systems have evolved for storing graph data (e.g., some as early as 1973, see IDMS [7]). Many of these earlier systems used traversal as a mechanism for retrieving information in a graph. That is, instead of a query that specifies properties of matching data, information is retrieved by using a starting node in the graph and traversing relationships between nodes. Some more modern systems have proprietary query languages while others have both traversal and query capabilities.

A graph database [4] has the distinguishing characteristic that it uses graph structures to store information where conceptually we have:

nodes that represent entities,
edges that are relationships between entities,
and information is associated with nodes and edges in the graph.

From 2019 through to the present, the participants of ISO/IEC JTC 1/SC 32 have been working on standardizing a query language for graphs that is specifically focused on the property graph model where nodes and edges may have labels and property sets. Regardless of the motivation to use property graphs, a network model with node and edge attributes is very common and valuable to data practitioners. As such, it is a very good place to consider standardization and the result was the ISO/IEC 39075:2024 GQL standard [2].

Property graphs

A property graph [5] is a graph G(N,E) with specific properties that can be informally described as:

N is a set of nodes,
E is a set of edges associated with two nodes of G,
Every node may have a set of label names,
Every edge may have a set of label names,
Nodes and edges may each have a property set that is a mapping from name to value.

Such a graph must also satisfy these additional constraints:

An edge may be a self loop with the same node used in the association.
If an edge is directed, one node must be the source and one node must be the destination.
Label names may not repeat for a specific node or edge; they may repeat across nodes and edges.
A property set is a mapping of a unique name to a value. That is, there is a single value for a unique name for a specific node or edge.

A good way to conceptualize a property graph is with an Entity-Relationship Model [9]. For example, a very simple graph of movies, series, cast members, and genres might start with a diagram like:

In the above ERM model, properties have been only associated on the Person, Movie, Genre, and Series entities. In the full ERM model, we can add properties to relationships as well. Also, in the ERM model, the type of the entity or relationship is a label in the property graph model. The only real difference with property graphs is that the label is just a set of names that does not necessarily imply a type or requirement for certain properties. Finally, you may also have multiple labels for nodes and edges (i.e., entities and relations in ERM).

GQL only considers property graphs as the underlying graph model. All queries, types, schemas, and results are built from queries surrounding property graphs alone. While you may be able to map other graph models into a property graph, the GQL language itself has no additional support for doing so.

What does a GQL query look like?

GQL shares a syntactic heritage with the openCypher query language [8] where the central concept is the graph pattern that describes a node or path in the graph. Within the query, identifiers can be bound to various aspects (e.g., a node, edge, or path) and results can be computed. A result is typically a table of data that may consist of simple values (e.g. a string or number) or complex objects like a node, edge, or path.

What transpires between a client that requested a query execution and the database engine returning the resulting data return is generally implementation defined. At a more abstract layer, results are returned via the binding table as part of the GQL agent’s execution outcome. While the way return values are transported is not part of the scope of the standard, it is easy to conceptualize this result as a table of data, as the name implies, but where the columns may be complex GQL objects such as nodes or edges.

While a simplification of the grammar, a query generally consists of a primitive query statements (§14.3 [2]) that are joined together much like a pipeline where the results of the previous statement are modified by the application of the next statement. At the end of this pipeline is often a return statement that formulates a result. Without the return statement, the response may be empty.

At the center of many expressions is a graph pattern (§16.3 [2]) that describes a set of path patterns that can be used to identify portions of the graph. These patterns are used by the containing statement to identify the target nodes and edges in the graph and are subsequently used to change the binding table to affect the query outcome. Patterns affect the binding table by assigning variables to portions of an expression (i.e., paths, nodes, or edges).

A graph pattern can express:

a node: (:Movie)
an edge: -[:cast]->
a path: (:Movie)-[:cast]->(:Person)

More complicated expressions can be built by combining these expressions. In the abbreviated syntax shown in the above examples, a label name is prefixed with a colon and qualifies to the kind of node or edge that matches. More than one label can be used with various operators to represent unions, disjunctions, conjunctions, etc.

The edge expression has various syntactic elements that indicate direction in addition to the label:

-[:cast]-> is an edge pointing right
<-[:cast]- is an edge pointing left
~[:cast]~ is an undirected
~[:cast]~> is an undirected or edge pointing right
<~[:cast]~ is an undirected or edge pointing left
<-[:cast]-> is an edge in any direction
-[:cast]- is any edge

These edge patterns can also be shortened to remove the brackets ([]) and label for the edge (e.g., ->, <-, ~, ~>, <~, <->, -, respectively).

Using the graph of movies described in the previous section, consider a simple expression using a match statement (§14.4 [2]) to find the cast of a popular movie:

MATCH (:Movie {title: 'Star Wars: A New Hope'})-[:cast]->(a:Person)
RETURN a.name, a.birth, a.death

The query matches a specific movie by title where the title property of a Movie labled node must match Star Wars: A New Hope. The pattern follows a directed edge that must be labeled cast to a node that must be labeled Person. In the expression (a:Person), a variable is bound to the matching nodes by preceding the colon with a variable identifier. The effect of the statement is the binding table is modified to include a column with a name of a whose values are the matching Person labeled nodes.

The following statement starting with the RETURN keyword is a primary return statement (§14.19 [2]) that contains a return statement (§14.11 [2]) that modifies the binding table to include three columns named name, birth, and death. This binding table is the working table outcome for the return statement is used as the execution outcome’s result.

A graph pattern can also represent more complex relationships. For example, the co-stars for particular individual might be retrieved as:

MATCH DIFFERENT EDGES 
  (:Person {name:'Mark Hamill'})<-[c1:cast]-(m:Movie)
    -[c2:cast]->(a:Person)
RETURN a.name

In this query, the DIFFERENT EDGES clause ensures that the edge c1 and c2 are different to ensure that Mark Hamill is not amongst the co-stars.

A brief tour of statements

There are five primitive query statements (match, filter, let, for, order by/pagination) and the select statement (§14.3 [2]) that form the basis of a linear query statement without considering the additional complexity of focused and nesting queries. The select statement is a single statement whereas the remaining five primitive query statements can be chained together for a combined effect.

In the prior section, the match statement was introduced. This statement can be used in combination with other statements to produce a specific result. For example, the binding table can be filtered with a filter statement (§14.6 [2]) where we want to list cast who appeared in more than one movie of a series:

MATCH (a:Person)<-[:cast]-(m:Movie)
        -[:partof]->(:Series {name:'Star Wars'})
FILTER WHERE count(m) > 1
RETURN a.name

The let statement allows us to bind variables to be used in queries:

MATCH (:Person {name:'Mark Hamill'})
         <-[:cast]-(m:Movie)-[:partof]->(:Series {name:'Star Wars'})
LET numberOfMovies = count(m)
MATCH (a:Person)<-[:cast]-(om:Movie)
        -[:partof]->(:Series {name:'Star Wars'})
FILTER WHERE count(om) > numberOfMovies
RETURN a.name

The for statement allows us to remove nesting. For example, we may simplify processing by returning the movie title along with the series genre as a single row of pairs of names:

MATCH (m:Movie)->(s:Series {name:'Star Wars'})
OPTIONAL MATCH (s)-[:genre]-(g)
FOR name IN g
RETURN m.title, name

Finally, we can order and limit results:

MATCH (a:Person)<-[:cast]-(m:Movie)
        -[:partof]->(:Series {name:'Star Wars'})
ORDER BY a.name ASCENDING
LIMIT 100
RETURN a.name

In contrast, the select statement is an example of a syntactic transformation in GQL. While there is some complexity in the transformation, the general form is the fields are selected from a matching graph query and the criteria and results are specified by direct syntactic transformations into other statements.

For example, consider the following select statement:

SELECT a.name,m.title
FROM CURRENT_GRAPH
  MATCH DIFFERENT EDGES 
    (:Person {name:'Mark Hamill'})<-[c1:cast]-(:Movie)
      -[c2:cast]->(a:Person)<-[c3:cast]-(m:Movie)
WHERE 
  NOT EXISTS { MATCH (m)-[:partof]->(:Series {name:'Star Wars'}) }

This select statement transforms into a query scoped by an inline procedure call (§15.2 [2]) that is followed by a filter statement and a primitive result statement. While the user of GQL does not need to be concerned with such transformation, there are many of these syntactic transformations that an implementor needs to be able to carefully construct.

CALL () {
  USE CURRENT_GRAPH
  MATCH DIFFERENT EDGES 
    (:Person {name:'Mark Hamill'})<-[c1:cast]-(:Movie)
      -[c2:cast]->(a:Person)<-[c3:cast]-(m:Movie)
  RETURN c1 AS c1, c2 AS c2, c3 AS c3, a AS a, m AS m
}
FILTER WHERE 
  NOT EXISTS { MATCH (m)-[:partof]->(:Series {name:'Star Wars'}) }
RETURN a.name, m.title

CRUD in GQL

Like most database query languages, GQL has statements for inserting nodes and edges, setting properties and labels, removing properties and labels, and deleting nodes and edges. An insert statement is a simple enumeration of graph patterns where variables can be assigned to elements of prior patterns for use in subsequent patterns.

For example, the insert statement (§13.2 [2]) can be used to add a movie and cast member into the graph. The following example results in a new Movie node, and edge labeled partof between that movie node and a new Series node, and a new Person node with a cast labeled edge pointing to it from the movie node:

INSERT 
(m:Movie {
   title: 'Star Wars: Episode IV - A New Hope', 
   released: '1977-05-25'}),
(m)-[:partof]->(:Series {name:'Star Wars'}),
(:Person {name:'Mark Hamill', birth: '1951-09-25'})<-[:cast]-(m)

A more nuanced insertion might use the result of a query to perform the insertion where only the new Person node and cast labeled edge are created:

MATCH (m:Movie {title: 'Star Wars: Episode IV - A New Hope'})
        -[:partof]->(:Series {name:'Star Wars'})
INSERT (:Person {
          name:'Mark Hamill', 
          birth: '1951-09-25'
        })<-[:cast]-(m)

A matched node or edge can have properties and labels set by a set statement (§13.3 [2]). For example, the birth property can be set on a matched Person labeled node:

MATCH (p:Person {name:'Mark Hamill'})
SET p.birth = '1951-09-25'

Similarly, a remove statement (§13.4 [2]) can be used to remove properties or labels. For example, we can ensure a matching Person node does not have a death property:

MATCH (p:Person {name:'Mark Hamill'})
REMOVE p.death

Finally, the delete statement (§13.5 [2]) can be used to remove nodes and edges from the graph entirely. For example, we can ensure a particular person is not a cast member by matching the Person node and using the delete statement on the bound variable for the edge:

MATCH (:Person (name: 'Kevin Bacon'))<-[c:cast]-(:Movie)
        -[:partof]->(:Series {name:'Star Wars'})
DELETE c

Curiously, there appears to be no counterpart to the MERGE statement from openCypher. The MERGE statement in cypher allows node and edges to be optionally matched by property values. If present, the existing nodes and edges would be used to construct the missing graph patterns. When not present, they are created. Without the semantics of this kind of merge statement, a client application must test for the existence of various nodes and edges to perform updates; a possibly expensive operation to perform outside the database engine.

Possible applications

The obvious candidates for adoption are current databases that implement property graphs and the openCypher query language. These products are positioned close enough to GQL that a simple syntax transformation may support simple GQL queries. Further, a more complete version of GQL may be accomplished by a more complicated transformation layer and that may be able to be layered without too much effort.

A complete GQL implementation may be farther afield. This particularly depends on the features of the query language required by certain applications. There are 228 optional features in the GQL standard. Whereas, particular application areas may require a common subset of these to be implemented for successful use.

In the near term, it is fair to assume that areas of applications will be the same areas currently addressed by property graph database technologies. Longer term, you may see an expansion of possibilities as relational databases add graph query capabilities (e.g., Apache AGE for Postgres [6]). Such things may bridge the gap for scaling database deployments.

GQL is unlikely to change the scope of applications where openCypher has currently been used to access property graph databases. Instead, it just helps legitimize the offering of these database engines by giving them a standardized query language. By doing so, developer and user tools can be separated from database engines and this may help adoption.

Conclusions

As a new standard, it remains to be seen what the pace and breadth of adoption will be as database vendors and users contemplate graph databases. There are a core set of likely vendors who are already using openCypher where the conversion to GQL as an alternate syntax is probable. Whether there are conformance levels that match the subsets of GQL that are useful to them is worth further research.

The GQL standard contains a variety of alternative syntaxes. For reasons not stated in the standard itself, there are keyword synonyms like NODE versus VERTEX, EDGE vs RELATION, etc. that seem superfluous and likely to lead to user confusion or interoperability issues. Similarly, there are various keywords that are simply optional without affecting the outcome (e.g., PROPERTY GRAPH vs GRAPH) and that seems also unneeded.

The standard itself is devoid of examples. It is not intended to be a primer. Yet, the semantics described in the General Rules of various statements feel under specified at times. The combination of these two will likely lead to the need for clarifications or revisions from the editors.

Within GQL there is also a schema language for describing graphs nodes, edges, and their labels and properties. While a good start at describing graph structures, it will need more abilities in future revisions. This schema language should be viewed as something for leaf types that the database engine uses for optimizations and basic type checking.

These concerns and opinions lead to one general observations: there will be revisions to the GQL standard. At a minimum, this revision will contain corrections or clarifications. Further revisions are unlikely to expand the scope of an implementor’s burden without sufficient justification. That is, there is plenty of room for product growth just making efficient scale-out databases execute GQL queries with reasonable metrics.

Meanwhile, let’s welcome the new kid on the block and see what they can do. Graph databases are not new but a common query language is new. As a center point for innovation, GQL could be the common cause needed to launch graph database technology into a new generation of applications of large-scale graphs.

References

[1] W3C Workshop on Web Standardization for Graph Data - Creating Bridges: RDF, Property Graph and SQL. 2019. Retrieved from https://www.w3.org/Data/events/data-ws-2019/

[2] ISO/IEC 39075:2024 Information technology - Database languages - GQL. 2024.

[3] Critical milestone for ISO graph query standard GQL. Retrieved from https://www.gqlstandards.org/gql-blogs/critical-milestone-for-iso-graph-query-standard-gql

[4] Graph Database. Retrieved from https://en.wikipedia.org/wiki/Graph_database

[5] Property graphs. Retrieved from https://en.wikipedia.org/wiki/Property_graphs

[6] Apache AGE : Graph Database for Postgres SQL. Retrieved from https://age.apache.org

[7] Russell, Andrew L. 2011. Oral-History: Charles Bachman. IEEE Oral History Network (April 2011).

[8] Marton, J., Szárnyas, G., Varró, D. 2017. Formalising openCypher Graph Queries in Relational Algebra. Advances in Databases and Information Systems 10509. doi:https://doi.org/10.1007/978-3-319-66917-5_13

[9] Chen, Peter Pin-Shan. 1976. The Entity-Relationship Model – Toward a Unified View of Data. ACM Transactions on Database Systems 1 (1976). doi:https://doi.org/10.1145/320434.320440

BalisageThe Markup Conference

Balisage Paper: Graph Query Language - the new kid on the block!

Alex Miłowski

`<alex@milowski.com>`

Table of Contents