Databricks Community

Daniel-Liden · ‎07-31-2024

Introduction: The Origins of Brickbot

If you attended the Databricks Data and AI Summit (DAIS) this year, you may have noticed a new addition to the event app: Brickbot, the summit AI assistant.

Brickbot, the Data and AI Summit Assistant

Brickbot was accessible to all in-person DAIS attendees and could answer questions about sessions, exhibitors, and general conference information, all with a cheery and helpful disposition.

The journey to this point was short but challenging. Two developers spent about eight weeks working on this project. In this post, we’ll describe how we got from the initial idea to the finished product, lessons learned along the way, and what we have planned for the future.

The Idea

Our original plans for Brickbot were very ambitious: we envisioned an AI assistant integrated into every part of the conference. We were excited by the idea of using AI to transcribe and summarize talks on the fly, enabling deep conversations on technical topics informed by the contents of talks, keynotes, and workshops. We wanted the bot to be able to generate maps, create agendas, schedule meetings, and suggest restaurants. We wanted deep personalization, such that the bot would be able to use logged-in users’ profiles to give highly personalized responses.

Of course, we had all of these ideas in the Spring, just a few months before the start of DAIS. After some serious discussions about what was feasible in the time we had (and what could get legal approval in that time), we decided to build a chat application that could:

Answer questions about sessions based on titles/abstracts/schedule information, including questions about session contents, times, and locations.
Answer questions about exhibitors based on descriptions and location information.
Provide general information about DAIS, such as meal times, wifi passwords, and helpdesk locations

Success Criteria

Even this narrowed scope wasn’t without risks. Especially now that people are getting used to it, AI assistants can frustrate just as easily as they can delight. If the bot gave inaccurate or unhelpful information with any degree of regularity, it would be a net negative experience for users. If it could be easily induced to give harmful or toxic responses, we wouldn’t be able to deploy it at the conference. The last thing we would want is people on Twitter sharing images of the chatbot bullying attendees! Even more benign issues were cause for concern—the bot would provide a poor experience if it was too slow, or if its responses were too boring.

With these points in mind, we settled on the following success criteria. If we could achieve these, we would feel comfortable deploying and sharing the application. If not, we would pull the plug, and the world would never meet Brickbot.

Correctness: Brickbot must be able to supply correct answers without hallucinating. It must also (politely) refuse to answer out-of-scope questions without fabricating answers.

Harmlessness: Brickbot must refrain from unkind or harmful answers. This includes circumstances where users are trying to induce the model to behave in ways outside of the intended scope.
Successful retrieval: Within the scope of sessions, exhibitors, and general information, Brickbot must be able to retrieve the right information about the conference with a high degree of accuracy.

We were ultimately able to reach the point where we felt we had satisfied these criteria. This was based on reviewing numerous internal chat sessions. We’ll cover some of the lessons learned along the way in a later section. But first—what happened once real users started using Brickbot at DAIS?

Brickbot at DAIS

Attendees at DAIS had thousands of conversations with Brickbot on topics including food and beverages, sessions, training and certification, and getting around the conference center. It generally handled in-scope questions well and refused to answer questions outside its areas of knowledge.

Types of questions DAIS attendees asked Brickbot

We ran into a few unexpected use cases and surprises:

Wifi password: A lot of people asked Brickbot for the conference wifi information. This wasn’t an issue—Brickbot had access to the wifi details—but it also wasn’t a use case we were expecting (also, the wifi details were on each attendee’s badge!). There were also a lot of questions about food and beverages and about getting around the conference center. We put a lot of our focus into improving retrieval and responses about sessions and may have underestimated the value of providing assistance with simply navigating the conference experience. Still, Brickbot handled these questions well.
Keynote Questions: Keynote announcements are among the most exciting parts of DAIS. And, judging from the questions Brickbot received, attendees were very interested in the announcements and wanted more information. Unfortunately, Brickbot did not have any information about the announcements and could not successfully answer these questions.

For example, one of the major announcements was about serverless compute on Databricks. Brickbot had no specific information about this Databricks feature, so it could only respond with general remarks about serverless technologies in general, which wasn’t helpful in the context of keynote announcements.
Jailbreaking attempts: There were some attempts to “jailbreak” Brickbot in order to change its behavior or force it to reveal its “secrets.” First of all, Brickbot had no secrets. There was no non-public information it could have provided. Second, by reviewing the conversation logs, we were able modify the relevant prompts in order to prevent any jailbreaking attempts that were successful the first time. Ultimately, these attempts had almost no potential to cause any harm.

Multilingual questions: Users asked questions of Brickbot in multiple languages. We did not explicitly build Brickbot with multilingual capabilities. The model seemed to perform well, regardless.

Reflecting on Brickbot: Lessons Learned

Clarity about scope enables focused work: There were many, many features we wanted to build into Brickbot. Some of them would have been more fun to work on than the nuts and bolts of good retrieval, reliability, and observability. But these features would not have meant much if the core functionality was missing. Setting clear and realistic goals let us focus on the parts that really mattered and let us develop a reliable version of Brickbot on time.
Real-world usage comes with surprises: Between the high prevalence of questions on wifi details, the jailbreaking attempts, and the non-English inputs, we saw plenty of uses we had not explicitly prepared for. Monitoring real-world usage is key for fixing any critical issues on the fly and for helping to iterate on future versions of the application.
Visibility into each component of a compound AI system is essential: We saw a lot of bad responses when we were developing and testing Brickbot. In order to fix them, we had to figure out where the issues came from. Did the retrieval system return the wrong information? Did the intent classifier misunderstand the question? Did the model give a bad response despite having access to the right information? To be able to answer these questions, we needed to be able to see how each component of the system behaved. We used MLflow tracing to record each step of the process of generating a response, enabling us to identify where any failures occurred.
Semantic search isn’t a magic bullet: In much of the online discourse around retrieval-augmented generation (RAG), the “retrieval” step is assumed to involve semantic search with a vector database. Our earliest Brickbot prototypes used semantic search. We found that the retrieval was quite poor, given the types of questions people were likely to ask. It wasn’t successful at identifying sessions by specific speakers or about specific technologies. We were ultimately able to get much better retrieval results using a combination of fuzzy text matching, query rewriting, and BM25 keyword search.
Modular design improves iteration velocity: As a compound AI system, Brickbot had numerous components, such as an intent classifier, a retrieval system, and the final response generation model. We often needed to change individual components or even swap components out entirely (such as by replacing semantic search with keyword search). Designing Brickbot in a modular way made it easier to iterate quickly: we could update individual components without needing to make sweeping changes to each of the other components. For example, we could change the model responsible for reading the retrieved information and generating the final response without needing to substantially revise any other components of the application.

Looking Ahead

We made Brickbot with a very small team on a tight timeline. We’re happy with what we accomplished, but there is a lot more we would like to accomplish given more resources and using the lessons we learned from this first version. We envision a Brickbot that is more informed (with e.g. real-time information about keynote announcements) and more integrated (with access to other information, possibly including personalized information about logged-in users, from the DAIS mobile app).

We also want to share more about our experience building Brickbot, showing how the Databricks platform can simplify the development of AI applications like Brickbot.

Lastly, we want to help users with developing their own applications like Brickbot! Get in touch or leave a comment if you are working on a relevant project, and let’s compare notes. Stay tuned for more on how we’ll use the learnings from Brickbot to make it as easy as possible to build AI applications on Databricks.

Databricks Community

Eight Weeks, Two Developers, One Production AI Conference Assistant

Introduction: The Origins of Brickbot

The Idea

Success Criteria

Brickbot at DAIS

Reflecting on Brickbot: Lessons Learned

Looking Ahead

Metadata-Driven ETL Framework in Databricks (Part-1)

Best practices for safe data experimentation with Databricks

Top 10 query performance tuning tips for Databricks Serverless SQL