Vectorizing domain data to help large language models talk the language of energy

In the realm of applying large language models (LLMs) to domain-specific questions, the challenge of knowledge cutoff looms prominently. This issue stems from the fact that LLMs are primarily trained on vast and diverse general corpora, often lacking depth in specific domains and includes content only up to a specific training checkpoint. Consequently, when faced with domain-specific queries or tasks, LLMs may struggle to provide accurate or relevant responses due to their limited exposure to domain-specific nuances, terminology, contexts, and recent content. For example, the term "Christmas tree" might evoke different meanings in different contexts. In general context, it represents a decorated tree for Christmas celebration, but in the energy industry, it refers to a wellhead assembly used in oil and gas production.

This is crucial in the energy industry, where precise answers can profoundly impact decision-making and outcomes, such as optimizing energy production, ensuring operational efficiency, or meeting regulatory requirements. In these critical scenarios, relying solely on general LLMs without domain-specific knowledge is wholly inadequate.

One way to mitigate this issue is to vectorize our domain data to enable integration of industry specific knowledge and context into LLMs. This could help the LLMs grasp the energy domain intricacies, and better contextualize domain-specific terms, to give more promising and relevant responses, with up-to-date knowledge. By incorporating recent and proprietary data through vectorization, LLMs can stay current with industry developments and provide accurate insights and solutions tailored to specific domain challenges.

Vector databases are a type of database designed specifically for storing, indexing, and searching for vectors as numerical representations of data, like text, images or audio. Unlike traditional databases that store the scalar values, vector databases handle data in vector form, which is ideal to deal with large volumes of unstructured data. This allows the data point to be represented in a multi-dimensional space, enabling the encapsulation of complex information from simple numerical datasets to high dimensional data like text, images, or videos. It is known for efficient similarity search, and its advanced indexing techniques and distributed architecture, allowing it to handle a vast amount of different datatypes (Lim, 2023).

 

Applying Large Language Models to Energy Industry Workflows
Figure 1: Workflow to create vector database.

 

There are several technical concerns to building an effective vector database. These considerations include the type of documents targeted for vectorization, the documents chunking strategy prior to storage in the vector database, determination of optimal chunk sizes, and the embedding models used to convert these chunks into embeddings or numerical representations (Figure 1). One study focused on drilling event identification, has demonstrated the significant impact of these factors on model performance. By carefully considering the interplay between these elements, we can achieve notable improvements in model accuracy and efficiency (Read more).

Our findings underscore the value of developing task-specific vector databases. For instance, in the context of drilling events identification (Read more), our database is enriched with sample drilling remarks and historical daily drilling reports (DDRs), thereby enhancing the domain knowledge of LLM in drilling. However, we caution against indiscriminately adding unrelated documents such as oil and gas glossaries or SPE glossaries, as this introduces noise and biases that can detrimentally impact model performance.

We have studied the impact of different factors in building an efficient domain relevant task specific vector database. The effectiveness of the vector database was evaluated based on the retrieval augmented generation (RAG) accuracy. The factors are listed below:

  1. We investigate the impact of the types and amount of preloaded data into the vector database.
  2. We investigate the impact of different splitting strategies, and the chunk size.
  3. We investigate embedding model impact.
  4.  We investigate the impact of the vector database type.

To validate results, it is crucial to understand the domain implications, designing a representative benchmark to ensure our business users gets the right answers. The results shown in the table below demonstrate the inference results for 13 drilling events from 342 drilling remarks retrieved from ~180 DDRs for 24 wells in a single field. The event classes are imbalanced, thus we considered both weighted and unweighted scores during results evaluation. Weighted scores emphasize the major class, and the unweighted score treats all classes equally. The metrics score we derived in the following sections are comparing with the labels provided by the drilling engineer, from manual reading. Also refer to this article Extracting Drilling Risks from Daily Drilling Reports using Generative AI

 

Table 1 Results comparing different input documents to build vector database. Metric below with chunk size 512, overlap size 20, using Hugging Face embeddings and recursive chunking strategy.
Applying Large Language Models to Energy Industry Workflows

 

Chunking is a process to break down large pieces of text from the input documents into smaller segments when storing to the vector database The effectiveness of vector databases relies on the chosen chunking strategies, as they contribute to optimizing content relevance for embedding into LLM inputs. In the study, we compared the results segmenting the text by characters, by sentence, or by a set of predefined separators. The results below demonstrate the impact of chunking strategies may not be significant, but an effective splitting method would give better model performance in terms of accuracy and recall, especially to the major drilling event classes.

 

Table 2: Results comparing different chunking strategies. Metric below with chunk size 512, overlap size 20, using OpenAI embeddings and input 22,000 DDR remarks.
Applying Large Language Models to Energy Industry Workflows

 

The other key factor is the chunk size, as it will impact the vector representations. Smaller chunks may be more concise and focused, by we may miss some context in the sentence. While larger chunks, it may give better context, but noise would be introduced. Different chunk sizes need to be tested for different task-specific input documents.

 

Table 3: Result comparing different chunk size. Metric below with input 22,000 DDR remarks, Hugging Face embedding, and recursive chunking strategies.

Applying Large Language Models to Energy Industry Workflows

 

Another crucial factor here is the embedding model we used to convert our data into vector representation. In the study, we used a cost-free embedding model, from Hugging Face to generate the numerical representation, the model performed better compared to Open AI embedding.

 

Table 4: Results comparing different embedding models. Metric below with input 22,000 DDR remarks, chunk size 512, overlap size 20, and recursive chunking strategies.
Applying Large Language Models to Energy Industry Workflows

 

We also compared the different open-source vector databases, and we found the model performance difference is trivial. But the memory saving using Facebook AI Similarity Search (FAISS) is consuming 70% less memory compared to Chroma database.

By considering the memory consumption and the effectiveness in vectorizing our task-specific domain knowledge base, we combined all the best parameters that we tested for better vectorization. We used 22,000 DDR remarks as input, recursive chunking strategy, chunk size of 512 characters and overlap size of 20 characters, and Hugging Face embedding models to build the FAISS vector database. We can see that although the precision for minor drilling event class reduced, overall it gives better confidence on the major drilling event class and increases the recall of the minor drilling event class by 6%.

 

Table 5 Comparing One-Shot Prompt vs Zero Shot RAG
Applying Large Language Models to Energy Industry Workflows

 

In conclusion, it is notable that there is still room for improvement on the precision of infrequently encountered classes of drilling risk. The current vector database we built is task-specific, which is relevant in order to avoid introducing noise during the retrieval process. For tasks involving similarity retrieval, vector database is normally a good fit. If we were to look deeper at the retrieval performance, with more complex, interconnected, and semantic rich data, then the knowledge graph would represent a better fit. This is to preserve structured, domain-specific knowledge. We may also consider hybrid approaches to leveraging their respective strength for better knowledge retrieval to enhance the LLM for a more confidence output, with reduced hallucination. Understanding how all the design parameters impact the accuracy and in turn the possible deployment options, and how to integrate them in real domain workflows, is key for successful integration. Without domain understanding of representative datasets and evaluation this becomes very difficult. Today the performance of this system using the best performance vectorization approach in this article is already enough to generate substantial business value and reduce risk.

 

View the full Insights Series

 Patricia Cejas Manceron

Lee Ming Xiang

Domain Data Scientist - Subsurface