Knowledge Graphs: Chat With Your Data

This is a continuation of my previous article on creating a Knowledge Graph in 100 lines of code. In this article I will show you how you can use the “chat with your data” paradigm to convert natural language to graph queries, including using semantics search over vector embeddings.

There are various approaches to “chatting with your data”, from pure structured query generation, to pure Retrieval Augmented Generation (RAG).

Here I will present a hybrid approach, powered by the Concerto Graph framework, which makes the approach largely declarative and easy to put into practice.

Before we proceed you will need a NEO4J Aura account (free accounts are available), as well as an Open AI API Key.

We start by defining our data model / ontology. The concepts in the data model represent the nodes and relationships within our Knowledge Graph

For the purposes of a demo we define a simple movie database data model (full code is here):

concept Person extends GraphNode {
  @label("ACTED_IN")
  --> Movie[] actedIn optional
  @label("DIRECTED")
  --> Movie[] directed optional
}

concept Actor extends Person {
}

concept Director extends Person {
}

concept User extends Person {
  o ContactDetails contactDetails
  o AddressBook addressBook
  @label("RATED")
  --> Movie[] ratedMovies optional
}

concept Genre extends GraphNode {
}

concept Movie extends GraphNode {
  o Double[] embedding optional
  @vector_index("embedding", 1536, "COSINE")
  @fulltext_index
  o String summary optional
  @label("IN_GENRE")
  --> Genre[] genres optional
}

We then use the GraphModel API to connect to the database and create indexes for all the concepts in the data model:

  const options:GraphModelOptions = {
    NEO4J_USER: process.env.NEO4J_USER,
    NEO4J_PASS: process.env.NEO4J_PASS,
    NEO4J_URL: process.env.NEO4J_URL,
    logger: console,
    logQueries: false,
    embeddingFunction: process.env.OPENAI_API_KEY ? getOpenAiEmbedding : undefined
  }
  const graphModel = new GraphModel([MODEL], options);  
  await graphModel.connect();
  await graphModel.deleteGraph();
  await graphModel.dropIndexes();
  await graphModel.createConstraints();
  await graphModel.createVectorIndexes();
  await graphModel.createFullTextIndexes();

We can then use the mergeNode and mergeRelationships to update or insert nodes/relationships in the Knowledge Graph. The properties of the nodes and relationships are validated against the data model, ensuring that only well-structured data can be added to the graph:

        await graphModel.mergeNode(transaction, 'Movie', {identifier: 'Fear and Loathing in Las Vegas', summary: 'Duke, under the influence of mescaline, complains of a swarm of giant bats, and inventories their drug stash. They pick up a young hitchhiker and explain their mission: Duke has been assigned by a magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive drugs for the trip, and rented a red Chevrolet Impala convertible.'} );

Once the graph is populated you can run a full-text search over the Movie nodes. Movies are automatically indexed for full-text search because they have the property summary, which has the @fulltext_index decorator.

  const fullTextSearch = 'bats';
  console.log(`Full text search for movies with: '${fullTextSearch}'`);
  const fullTextResults = await graphModel.fullTextQuery('Movie', fullTextSearch, 2);

Returns the single result that contains the string ‘bats’:

[
  {
    summary: 'Duke, under the influence of mescaline, complains of a swarm of giant bats, and inventories their drug stash. They pick up a young hitchhiker and explain their mission: Duke has been assigned by a magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive drugs for the trip, and rented a red Chevrolet Impala convertible.',
    score: 0.4010826349258423,
    identifier: 'Fear and Loathing in Las Vegas'
  }
]

Next, let’s use a conceptual (semantic) search to find 3 movies that are about a given concept, rather than looking for specific text in the summary:

const search = 'working in a boring job and looking for love.';
    console.log(`Searching for movies related to: '${search}'`);
    const results = await graphModel.similarityQuery('Movie', 'summary', search, 3);
    console.log(results);

This returns 3 results, with the movie that is the most similar to the concept “working in a boring job and looking for love” being “Brazil”:

[
  {
    identifier: 'Brazil',
    content: 'The film centres on Sam Lowry, a low-ranking bureaucrat trying to find a woman who appears in his dreams while he is working in a mind-numbing job and living in a small apartment, set in a dystopian world in which there is an over-reliance on poorly maintained (and rather whimsical) machines',
    score: 0.7013247609138489
  },
  {
    identifier: 'Fear and Loathing in Las Vegas',
    content: 'Duke, under the influence of mescaline, complains of a swarm of giant bats, and inventories their drug stash. They pick up a young hitchhiker and explain their mission: Duke has been assigned by a magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive drugs for the trip, and rented a red Chevrolet Impala convertible.',
    score: 0.5629135966300964
  },
  {
    identifier: 'The Man Who Killed Don Quixote',
    content: `Instead of a literal adaptation, Gilliam's film was about "an old, retired, and slightly kooky nobleman named Alonso Quixano".`,
    score: 0.5493587255477905
  }
]

Next, let’s use the ability to convert a natural language query into a structured Neo4J Cypher query. We use the query “Which director has directed both Johnny Depp and Jonathan Pryce, but not necessarily in the same movie?” which gets correctly converted to the Cypher query shown below, and the result “Terry Gilliam” is displayed. Bonus points if you got this far and know why the result appears twice!

const chat = 'Which director has directed both Johnny Depp and Jonathan Pryce, but not necessarily in the same movie?';
const chatResult = await graphModel.chatWithData(chat);
    
Chat with data: Which director has directed both Johnny Depp and Jonathan Pryce, but not necessarily in the same movie?
Converted to Cypher query: MATCH (d:Director)-[:DIRECTED]->(m:Movie)<-[:ACTED_IN]-(a1:Actor {identifier: "Johnny Depp"}),
      (d)-[:DIRECTED]->(m2:Movie)<-[:ACTED_IN]-(a2:Actor {identifier: "Jonathan Pryce"})
RETURN d.identifier
[
  {
    "d.identifier": "Terry Gilliam"
  },
  {
    "d.identifier": "Terry Gilliam"
  }
]

Finally, our coup de grâce! Let’s create a natural language query that has to exploit both the structured data in the graph, as well as the unstructured text data that we’ve indexed using vector embeddings:

const search = 'working in a boring job and looking for love.';
const chat2 = `Which director has directed a movie that is about the concepts of ${search}? Return a single movie.`;
const chatResult2 = await graphModel.chatWithData(chat2);

Calling tool: get_embeddings
Tool replacing embeddings: MATCH (d:Director)-[:DIRECTED]->(m:Movie)
CALL db.index.vector.queryNodes('movie_summary', 1, <EMBEDDINGS>)
YIELD node AS similar, score
MATCH (similar)<-[:DIRECTED]-(d)
RETURN d.identifier as director, similar.identifier as movie, similar.summary as summary, score limit 1
[
  {
    "director": "Terry Gilliam",
    "movie": "Brazil",
    "summary": "The film centres on Sam Lowry, a low-ranking bureaucrat trying to find a woman who appears in his dreams while he is working in a mind-numbing job and living in a small apartment, set in a dystopian world in which there is an over-reliance on poorly maintained (and rather whimsical) machines",
    "score": 0.7065032720565796
  }
]

In the output trace we can see a couple of steps:

Open AI determining that to satisfy this query it needs vector embeddings for the string ‘working in a boring job and looking for love.’ via an Open AI tool configuration.
Generation of a Cypher query with a placeholder set of embeddings <EMBEDDINGS>. If you are curious to see how this works you can find the prompt we use for Open AI here. It combines the Concerto domain model with the natural language query and some information about index naming to create (mostly) valid Cypher queries.
Generation of the embeddings using the Open AI embedding model
Execution of the Cypher query, which correctly returns Terry Gilliam as the only (most likely) director of a movie about the concepts of ‘working in a boring job and looking for love.’

Voilà! We are chatting with our data — and we’ve not written any application specific code, we’ve just defined our data model and called some framework APIs. For a full-fledged movie database example please refer to my movie-graph demo.

To give you a hint of how powerful this can be, take a look at this transcript:

Which actor famously starred in a film conceptually about journalism, hitchhiking and drugs in Las Vegas?
Calling tool: get_embeddings
Converting query with embeddings to Cypher...
Tool replacing embeddings: MATCH (m:Movie)
    CALL db.index.vector.queryNodes('movie_summary', 3, <EMBEDDINGS> )
    YIELD node AS similar, score
    MATCH (similar)<-[:ACTED_IN]-(a:Actor)
    RETURN a.identifier as actor_identifier, similar.identifier as movie_identifier, similar.summary as movie_summary, score limit 1 with: journalism, hitchhiking and drugs in Las Vegas
[
  {
    "actor_identifier": "Johnny Depp",
    "movie_identifier": "Fear and Loathing in Las Vegas",
    "movie_summary": "Duke, under the influence of mescaline, complains of a swarm of giant bats, and inventories their drug stash. They pick up a young hitchhiker and explain their mission: Duke has been assigned by a magazine to cover the Mint 400 motorcycle race in Las Vegas. They bought excessive drugs for the trip, and rented a red Chevrolet Impala convertible.",
    "score": 0.7222381234169006
  }
]

Have fun!

Innovation That Matters

Technology trends for the world of tomorrow