Matillion-Report Generative AI Workflows
Matillion-Report Generative AI Workflows
Matillion-Report Generative AI Workflows
Generative AI Workflows
Architectural Approaches and Use Cases
BY KEVIN PETRIE
OCTOBER 2023
RESEARCH SPONSORED BY
2
Generative AI Workflows
Table of Contents
3
Generative AI Workflows
Executive Summary
Generative artificial intelligence (GenAI) offers compelling opportunities for companies to boost productivity
and gain competitive advantage. To realize this opportunity, companies need to implement workflows
that span text sources, data pipelines, vector databases, language models, and applications—all of which
stretch the capabilities of modern data teams. They should design these workflows to be automated,
overseen by humans, modular, open, governed, and efficient.
Workflows driven by language models (LMs) can help companies increase productivity, efficiency, and
revenue as they support use cases such as customer service, document processing, and medical care. They
represent the next wave of innovation, coming on the heels of employee adoption of LMs such as ChatGPT
and LM-enabled tools such as GitHub Copilot. Data and AI leaders can increase the odds of success with
their generative AI initiatives by following three guiding principles:
> Pain signals opportunity. If your company feels pain with certain manual processes—and most do—
turn this weakness into strength. Find and capitalize on opportunities to accelerate operations, delight
customers, and steal customers from your rivals.
> Pick your battles. Most companies cannot hope to build an LM with better language processing
capabilities than ChatGPT, Bard, or BLOOM AI. A more viable option is to fine-tune or prompt existing LMs
with your own domain-specific data.
> Govern. Govern. Govern. Generative AI, like other types of AI, can exacerbate long-standing risks to data
quality, privacy, and handling of intellectual property. Design your data pipelines to minimize these risks
while preparing and governing text for consumption by LMs.
4
Generative AI Workflows
High Hopes
Generative AI (GenAI) offers compelling opportunities for companies to boost productivity and gain
competitive advantages. This powerful technology can converse with customers, process documents in
real time, and extract business value from the text that flows through their organizations. It goes beyond
automation to help us understand and enrich business experiences in new ways.
To realize this opportunity, companies need to implement workflows that span text sources, data
pipelines, vector databases, language models, and applications—all of which stretch the capabilities of
modern data teams. This report defines elements of the workflows and data pipelines that feed generative
AI initiatives. It also explores key design requirements and profiles several emerging GenAI use cases.
Generative AI
We’ve come a long way since Alan Turing first asked whether machines could outperform humans in
his groundbreaking paper “Computing Machinery and Intelligence” more than 70 years ago, helping
launch the field of artificial intelligence. After a number of false starts, the field of AI has blossomed in
recent years thanks to advances in data collection, compute power, and algorithms. These advances
especially contribute to GenAI, a type of AI that generates digital content such as text, images, or audio
after training itself on a corpus of existing content. GenAI helps humans process and generate content at
an unprecedented speed and scale.
Language Model
The most broadly applicable form of GenAI centers on a language model, which is a type of neural network
whose interconnected nodes share inputs and outputs as they interpret, summarize, and generate text.
To do this, the LM measures the interrelationships of words and character strings—i.e., tokens—in written
content. After studying how millions of tokens relate to one another, it learns to create its own strings of
tokens that become logical sentences for humans to read. A trained LM produces conversational answers
to natural language prompts, generating articulate paragraphs with dazzling speed.
5
Generative AI Workflows
Vendors such as OpenAI, Google, Microsoft, Hugging Face, and Anthropic offer commercial and open-
source LM tools that converse with humans about a wide array of topics. Employees and mainstream
consumers rushed to experiment with these LM tools after OpenAI released ChatGPT-3 in November
2022. The speed, sophistication, and range of this new technology enabled OpenAI to amass 100 million
active users in two months. These tools are examples of so-called “large language models” because they
were trained on high volumes of public data.
Market Overview
To understand the market, let’s examine companies’ different stages of adoption as well as their options
for tackling domain-specific use cases.
Adoption Stages
Companies adopt GenAI in three stages: (1) As standalone LM tools, (2) as LM functions within other tools,
and (3) as LM-driven workflows. With each stage, companies become more dependent on their internal,
domain-specific data—and therefore their own processes for data delivery. Because stages 2 and 3
require companies’ domain-specific data, Eckerson Group calls them “small language models.” To keep
things simple, this report uses the umbrella term of “language models” or LMs.
6
Generative AI Workflows
Stage 1: Language model tools. Employees use standalone LM tools as platforms to assist their work.
They might create content, write code, or conduct research on an ad hoc basis using a limited amount
of company data. These tools introduce the risk of hallucinations, privacy breaches, and bias because
vendors train them on public and potentially outdated inputs. Such risks, coupled with a peak in the
“hype cycle” and rising competition from both commercial and open-source LM tools, contributed to a
drop in visitors to ChatGPT’s website in mid-2023.
Stage 2: LM functions in non-LM tools. The second stage poses fewer risks, because software
companies control the usage of LMs. In this stage, employees use LM functions as assistants within a
broad array of commercial applications, such as Einstein within Salesforce, Joule within SAP, and
Copilot within GitHub. In a similar way, data pipeline tools such as Matillion now include LM assistants.
All these applications and tools boost user productivity by automating, explaining, and recommending
actions based on company data and metadata. Some companies might leapfrog to stage 2 adoption,
skipping stage 1.
Stage 3: LM-driven workflows. Companies build LM functionality into the tools, applications, and APIs
that comprise their proprietary workflows. LM-driven workflows consume lots of domain-specific data,
including both traditional tables and unstructured text that companies did not analyze before. These
Yes–in
production
13%
No plans
27% Planning or
considering
33%
Yes–in pilot
27%
7
Generative AI Workflows
workflows can do more than just boost productivity; companies implement them to craft sustainable
competitive advantage based on their differentiated activities and data sets. They represent the next
wave of GenAI innovation.
And this wave is underway. In a recent poll of technology professionals, 40% of respondents told
Eckerson Group that their company has LM-driven workflows in pilot or production, with another 33%
planning or considering it (see figure 2).
> Building from scratch. This option involves collecting and preparing a corpus of text, then designing
an LM neural network that learns to interpret, summarize, and generate text based on patterns it
identifies in the source data set. Building an LM from scratch requires extensive data science expertise,
high data volumes, many iterations, and lots of compute cycles. Most companies avoid this option
because they lack the funding and expertise to produce anything more effective than what vendors
such as OpenAI, Google, and Microsoft have built already.
> Fine-tuning. Companies can take a pre-trained LM such as ChatGPT from OpenAI or BLOOM from
Hugging Face and fine-tune it to better interpret domain-specific content. With this option, they apply
the LM to their own content, check outputs, and adjust parameters—that is, the many variables that
describe content patterns—to make outputs more accurate over time. Fine-tuning requires a lot of
compute capacity and can prove prohibitively expensive due to the current shortage of AI-friendly
graphics processing units (GPUs).
> Retrieval augmented generation (RAG). Companies also can augment prompts by inserting excerpts
from domain-specific content such as product documentation, customer service records, chapters of
books, and academic articles. RAG asks the LM to find the answer within that trusted content, reducing the
risk of hallucinations or other issues. RAG is known as a type of “grounding” because it creates a foundation
of facts. It costs less money than fine-tuning because it does not require compute-intensive workloads.
8
Generative AI Workflows
This report focuses on data pipelines that support RAG as the most common and viable implementation
option for LM-driven workflows.
Challenges
A few chronic challenges stand in the way of successful GenAI initiatives, in particular those involving
LM-driven workflows. The first is inefficient data delivery thanks to proliferating users, devices,
applications, data sources, and platforms that create silos. This drives complexity because companies
have to build data pipelines and models to integrate the data before applying AI. Second, many data
teams lack the staff and coding skills they need to integrate multi-structured data across siloed
systems—or to prepare unstructured text for LM consumption. This rising complexity leads to the
third challenge: untrustworthy data, including incomplete, outdated, inconsistent, erroneous, or
inaccurate tables as well as easily misinterpreted text files. A fourth and related challenge is inadequate
governance, as companies struggle to create and enforce policies that control data usage, reduce risk,
and ensure regulatory compliance.
Benefits
Companies that overcome these challenges can achieve outsize business benefits. Their workforces
become more productive—consider the sales manager who makes higher-impact pitches to more
prospects each week, or the insurance manager who processes policy claims faster and with fewer
errors. Their operations become more efficient—consider the factory that can add or drop component
suppliers faster to respond to changing market conditions. GenAI also can boost revenue—consider how
these examples increase sales effectiveness, customer satisfaction, and manufacturing capacity.
Perhaps the most important benefit is competitive advantage, which companies primarily achieve during
the third stage of GenAI adoption. These competitive advantages might derive from the high-impact
sales pitch or claims-processing workflows described above. They might derive from new products or
services—consider a furnishing company that conjures unique virtual tours of prospects’ own homes
featuring new furniture, or a service chatbot that improves customer retention by making individual
offers based on real-time sentiment analysis.
In these and other ways, companies across industries have the opportunity to disrupt the competition by
optimizing activities and human interactions with the help of LM-driven workflows.
9
Generative AI Workflows
Requirements
Companies must design LM-driven workflows—and the data pipelines within them—to be automated,
overseen by humans, modular, open, governed, and efficient.
> Automated. The more data teams reduce manual coding and scripting, the better they can improve
efficiency and support rising business demands with existing staff. Templates, schedulers, and
graphical user interfaces (GUIs) help automate data pipeline management.
> Overseen by humans. Machine-learning models and LMs assist and accelerate each stage of pipeline
management. But humans must oversee these assistants and manage them like talented but
inexperienced subordinates. They need expert human guidance, supervision, and approval to ensure
they take the right actions and do not hallucinate.
> Modular. Data teams should manage pipeline elements much like LEGO pieces: standard building
blocks that can form a myriad of combinations. For example, a commercial data pipeline tool can help
users create, use, and reuse elements to form whatever pipelines the business demands.
> Open. Data pipelines and environments need unfettered access to the full ecosystem of commercial
and open source elements that contribute to modern AI/ML initiatives. They need to maintain data
portability and tool interoperability across platforms, processors, formats, libraries, and languages.
> Governed. The risks of GenAI underscore the need for effective governance of data, models, outputs,
and workflows. Governance teams must enforce standards and policies that reduce the likelihood of
hallucinations, privacy breaches, bias, regulatory infractions, or mishandling of intellectual property.
They also must ensure ethical behavior (including safety and fairness) as well as transparency and
accountability.
> Efficient. AI/ML workloads can break budgets by consuming more elastic cloud compute cycles than
expected. Data teams must streamline compute requirements where possible. They also should adopt
FinOps processes to forecast, measure, and account for the consumption of those compute resources.
10
Generative AI Workflows
Architecture
The core architectural elements that contribute to LM-driven workflows are text sources, data pipelines,
vector databases, and custom applications that contain language models (see figure 3). All of them must
operate within a governance framework. Let’s explore each of these in turn.
CUSTOM LANGUAGE
GOVERNANCE FRAMEWORK APPLICATION MODEL
QUERY INTERPRET
ENRICH RESPOND
DATA PIPELINE
TRANSFORM
VECTOR DATABASE
DATA EXTRACT REFORMAT
OBJECTS TOKENIZE LOAD STORE
Text files, etc. LOAD CHUNK
SEARCH
RETRIEVE
EMBED
Data Objects
© Eckerson Group 2021 Twitter: @eckersongroup www.eckerson.com
A variety of unstructured and semi-structured data objects, many containing text, feed the pipelines that
support LM-driven workflows. Applications such as Gmail, Google Docs, Oracle NetSuite CRM, and
the Fathom meeting assistant create and store text that describes the myriad interactions of modern
business. Files in PDF or other formats also contain reports, documentation, and so on. These and
other text sources offer rich material to support LM-driven workflows. Other data types include images,
audio files, or even video clips. The “Additional Elements” section later in this report also explores how
relational databases or graph databases can deliver multi-structured data to LM-driven workflows. This
section focuses on text as the primary example of inputs to the vector database.
Data Pipeline
The data pipeline comprises familiar stages: extraction, transformation, and loading (ETL), each assisted
and automated by the tools cited below. Although the order of these stages can vary, a common pipeline
sequence for LM-driven workloads is extract and load, transform, and load again.
Extract and load. Data engineers and data scientists use pipeline tools such as Matillion or AirByte to
extract relevant text from applications and files, then load it into a landing zone on platforms such as the
Databricks lakehouse.
Transform. Next, data teams take a series of steps to transform the data and prepare it for LM
consumption: reformatting, tokenizing, chunking, and embedding.
11
Generative AI Workflows
> Reformat. Data engineers or data scientists convert this multi-sourced text to a common format, such
as Delta, JavaScript Object Notation (JSON), or comma-separated values (CSV) for efficient access and
manipulation.
> Tokenize. Natural language processing (NLP) engineers, data scientists, or ML engineers work with
data engineers to convert the text into tokens, using tools such as Matillion or Hugging Face’s BERT
tokenizer. Each token represents a word, character string, or punctuation mark.
> Chunk. Stakeholders also use these tools to divide text into “chunks” or combinations of tokens to help
maintain context and coherence. Chunks are often just sentences or paragraphs from the source text.
> Embed. ML engineers, data scientists, or NLP engineers create “embeddings,” which are numerical
vectors that describe the meaning and interrelationships of chunks. Techniques such as Word2Vec
and GloVe help with this process, as do tools from Matillion, OpenAI, or the LangChain framework.
Load. Now the data engineers, ML engineers, or database administrators load the embeddings into the
vector database via replication. Once again they use tools such as Matillion to perform the loading.
Vector Databases
Vector databases play a pivotal role in GenAI initiatives because they transform and deliver text inputs
in a readily usable format for consumption by LMs. They enable GenAI applications to enrich prompts
with governed inputs, reduce hallucinations, and generate more trustworthy outputs. The vector
database might be a specialist platform such as Pinecone, Weaviate, and Qdrant, or a zone within a
broader platform such as Redis, SingleStore, and Databricks. The vector database stores, searches, and
retrieves embeddings to support the custom application and its LM.
> Store. The vector database stores and indexes embeddings, capturing all their numerical values to
make them easily searchable.
> Search. The vector database searches through the embeddings based on their semantic similarity to
content within the query it receives from the custom application.
> Retrieve. It finds the most similar embeddings, retrieves their content, and delivers that content to the
custom application.
Custom application
Software developers build a custom application that queries the vector database and enriches prompts
to assist the LM.
> Query. The custom application queries the vector database by asking it to perform the similarity
search described above.
12
Generative AI Workflows
> Enrich. Then the application takes the query results from the vector database and injects them into
the prompt, thereby enriching the prompt.
Language model. Software developers and ML engineers implement the LM within the custom
application.
> Interpret. The LM interprets the prompt, including the content that the application injected into it.
> Respond. It generates a response to the prompt. This response has a lower likelihood of hallucination
because it incorporates existing factual content.
Governance Framework
Companies must design and implement LM-driven workflows within a governance framework that reduces
the risks described earlier. Compliance officers and other governance stakeholders, sponsored by the chief
data officer, create standards and policies to control the access and usage of data. Data stewards then
enforce those policies with the help of stakeholders such as data engineers and ML engineers. This includes
filtering the domain-specific data that feeds into these pipelines and workflows to ensure it represents the
“ground truth” of the business. Governance frameworks include tools such as master data management,
which improves data quality by applying consistent terminology to customers, products, and other
business entities. Data observability tools also improve quality by inspecting and validating data sets.
Additional Elements
As one might expect, the data pipeline, vector database, and custom application do not exist in isolation.
LM-driven workflows might also include relational databases, graph databases, and commercial
applications. Together these elements increase the types of data and therefore the types of use cases
that LM-driven workflows can support. Figure 4 illustrates the additional elements in blue.
USERS
COMMERCIAL APPLICATION
LANGUAGE CUSTOM
MODEL APPLICATION
DATA PIPELINES
13
Generative AI Workflows
> Relational databases. Most LM-driven workflows still need hard numbers that reside in the tables
of relational databases rather than text files. A support workflow, for example, might need the details
of a recent purchase so the LM can converse intelligently with an angry customer. For this reason,
companies integrate SQL query processes (possibly using BI tools) and relational databases into
their LM-driven workflows. This integration enables LMs to respond to user prompts with trustworthy
numbers as well as text.
> Graph databases. Some types of data are best suited for a graph database, which uses interconnected
nodes and edges to describe complex interrelationships between various entities. These entities might
include people in a social network, buyers and sellers in a marketplace, or households in an electricity
grid. Graph databases help LM-driven workflows understand such relationships, enabling them to (for
example) detect potential fraud among multiple parties in a financial transaction. Graph databases
help these workflows support many such use cases.
> Commercial applications. Most of these workflows also depend on commercial applications to
interface with customers, employees, and business partners. Such users expect to keep using their
familiar interface—Zendesk, Hubspot, Slack, Asana, Google Workspace, you name it—rather than
having to learn something new. To win early adopters, LM-driven workflows therefore need to integrate
with existing commercial applications.
Use Cases
Now we’ll explore three use cases for these data pipelines and LM-driven workflows: customer service,
document processing, and medical research.
Customer Service
Suppose a manufacturer of radio frequency identification (RFID) systems struggles to answer a deluge
of questions from shipping container companies, merchant vessel operators, and railway companies
about how to install its products. This manufacturer’s service team lacks the call support staff and field
technicians it needs to walk users through the process of installing the RFID transponders, receivers, and
transmitters. Recognizing an opportunity to both alleviate current pain and leapfrog the competition, this
RFID manufacturer decides to invest in GenAI.
14
Generative AI Workflows
The chief data officer and head of customer service assemble a cross-functional team that comprises
data scientists, data engineers, ML engineers, software developers, call support reps, and field
technicians. Together these stakeholders implement a data pipeline based on Matillion and an LM-driven
workflow based on the Claude AI assistant from Anthropic. This workflow uses RAG to feed product
documentation from the Weaviate vector database into customer prompts, generating fast and accurate
answers to wide-ranging technical questions from users. By consulting, reviewing, and editing the LM
responses, call support staff can close more service tickets faster. This reduces the need to dispatch field
technicians, which saves money, and it improves customer satisfaction, which increases repeat business.
Document Processing
Now suppose a financial services company struggles to keep up with demand for new mortgages. It
doesn’t have enough service reps and loan officers to process rising applications from the southwestern
United States, a region it expanded into after acquiring a local firm there. The financial services company
knows from several recent hires that its major regional competitor also cannot keep up with demand. As
in our previous example, executives spot an opportunity to both alleviate current pain and leapfrog the
competition with GenAI.
The chief data officer and head of the mortgage division recruit a cross-functional team of data scientists,
data engineers, NLP engineers, and software developers, as well as service reps and loan officers.
They implement a new data pipeline that transforms and loads three years of loan documents—
applications, customer letters, legal contracts, etc.—into a vector database. The vector database, coupled
with a relational database containing historical loan data, enriches prompts and enables the LM to
automatically draft key documents in the loan application process. Service reps and loan officers review
and edit these drafts rather than writing them from scratch. They reduce their backlog, process more
applicants faster, and gain market share.
15
Generative AI Workflows
Medical Research
As a final use case, consider a publishing firm that curates a repository of biomedical and life science
research that spans decades of scholarly articles from thousands of journals. Doctors, faculty, and students
all over the world use this repository to help them practice medicine and research cures to various diseases.
However, general practitioners sometimes misdiagnose unusual conditions because they don’t have the
time to find and review all the scholarly articles that might be relevant to a given patient case. As a result,
they might prescribe ineffective treatment or be slow to refer patients to the right specialists.
This publishing firm’s president, a former practicing MD, decides to address this problem using the power
of GenAI. She collaborates with the chief data officer, data scientist, data engineer, and data steward to
replicate and transform the research repository into a vector database as part of a RAG implementation.
Doctors prompt the language model by asking key questions and copying patients’ anonymized records
into the prompt window. Based on that information, the application queries the vector database and
retrieves relevant articles from the repository. It enriches the prompt with this information, enabling the
language model to summarize findings and recommend key passages to review. Armed with a new level
of detail, a doctor can rapidly diagnose a condition or refer a patient to the right specialist.
Guiding Principles
The GenAI innovation wave is well underway, with companies across sectors starting to implement LM-
driven workflows to increase productivity and sharpen their competitive positions. These workflows span
text sources, data pipelines, vector databases, and custom applications that contain language models,
all operating within a governed framework. Cross-functional teams must design these workflows to be
automated, overseen by humans, modular, open, governed, and efficient.
The following guiding principles can help data and AI leaders increase the odds of success with their
generative AI initiatives:
> Pain signals opportunity. If your company feels pain with certain manual processes—and most
do—turn this weakness into strength. Find and capitalize on opportunities to accelerate business
operations, delight customers, and steal them away from your rivals.
> Pick your battles. Most companies cannot hope to build a language model with better language
processing capabilities than ChatGPT, Bard, or BLOOM. A more viable option is to fine-tune or prompt
existing LMs with your own domain-specific data.
> Govern. Govern. Govern. Generative AI, like other types of AI, can exacerbate long-standing risks to
data quality, privacy, and handling of intellectual property. Design your data pipelines to minimize
these risks while preparing text for LM consumption.
16
Generative AI Workflows
> Our thought leaders publish practical, compelling content that keeps data analytics leaders abreast
of the latest trends, techniques, and tools in the field.
> Our consultants listen carefully, think deeply, and craft tailored solutions that translate business
requirements into compelling strategies and solutions.
> Our advisors provide competitive intelligence and market positioning guidance to software vendors
to improve their go-to-market strategies.
Eckerson Group is a global research, consulting, and advisory firm that focuses solely on data and
analytics. Our experts specialize in data governance, self-service analytics, data architecture, data
science, data management, and business intelligence.
Our clients say we are hardworking, insightful, and humble. It all stems from our love of data and our
desire to help organizations turn insights into action. We are a family of continuous learners, interpreting
the world of data and analytics for you.
Get more value from your data. Put an expert on your side. Learn what Eckerson Group can do for you!
G E T • M O R E • V A L U E • F R O M • Y O U R • D A T A
17
Generative AI Workflows
About Matillion
Matillion is the productivity platform for data teams.
Thousands of enterprises including Cisco, DocuSign, Slack, and TUI trust Matillion to move, transform,
and orchestrate their data for a wide range of use cases from insights and operational analytics, to data
science, machine learning, and AI.
Native integration with popular cloud data platforms such as Snowflake, Databricks, Amazon Redshift
and Google BigQuery lets data teams at every skill level automate management, refinement, and data
delivery for every data integration need.
18