My main business is cloud security education, and I have been associated with the Cloud Security Alliance for more than a decade now.
Currently, the Certificate of Cloud Security Knowledge (CCSK), their relevant body of knowledge, is being upgraded to version 5. I was involved in that, and it got me thinking about LLM (Large Language Models) technology for addressing a couple of needs.
Let’s start with the genesis of the version 5 that we got to work with. This was the product of hundreds of volunteers who offered their suggestions on how to improve the base document. What happened was that most people focus on adding what they think is relevant content. But that content is not always in the right place. This process also leads to a lot of duplication. And few people occupy themselves with eliminating material or reorganising it.
De-duplicate
As a result, I found myself in need of tools to de-duplicate the content. And since you might find the same concept expressed in different ways, semantic search is called for. I was already familiar with ‘Retrieval Augmented Generation’ and chatbots, so I thought to give that a shot. In good cloud computing tradition, I did research on build versus buy. I found a couple of decent tools online: ChatPDF.com and Poe.com. They were useful, but there are limitations that I may talk about later. Through some demos and tutorials on Langchain websites, I managed to pull together my version of such a chatbot ingesting the draft CCSKv5 body of knowledge. For one, this gives me more insight into what is going on under the hood of a RAG chatbot. It also gives me more control over the process.
Did it help me edit the documents? Yes, and there is room for improvement.
As I write this, I am still tinkering with it. Let’s just briefly summarize the first experiments and open issues.
Split
RAG works by chunking text into smaller pieces. ChatPDF and Poe don’t seem to take into account the document structure of headings and subheadings, which in the body of knowledge is of significant semantic relevance. They just chunk the text trying to keep some of the sentence structure intact (e.g. RecursiveTextSplitter). There is also a Markdown splitter, so I tried that. This should have the advantage that you can annotate the answers with the actual section header, which I haven’t gotten around to yet. How much better results are achieved through the Markdown splitter remains to be seen.
In the next step of RAG, relevant chunks of the source text are pulled (retrieved) out of a so-called vector database. I noticed in some of the experiments that this retriever isn’t always accurate on acronyms and abbreviations. It misses quite a few. I can understand that as a side effect of the sentence embedding process, but it is a loss. To address that, it looks like we can try hybrid search. Again: experiments to perform.
Comparing the results
The most pressing need I feel right now is to set up some (semi)automatic system for judging the quality of responses, and compare those across variations of the RAG pipeline.
In the end, for this use case, we want a chatbot that tells us for each question where it is addressed in the body of knowledge. Another use would be to have a concept explained to a student. I have let some of my GenZ students loose on this, and they seem to like it. Peculiarly, asking the same questions to a general chatbot, not fed with our body of knowledge, seems to be giving almost the same answers. ChatGPT and friends are pretty good at cloud security!
These student conversations are just a start in the arsenal of learning tools. I am also looking at generating learning cases, and multiple-choice questions. Initial experiments give promising results. Beyond this, more important perhaps, and more challenging, is using LLM technology to uncover hidden assumptions in a text on the prerequisite knowledge that students should have, but don’t. The best I have for that now is a laborious exercise of inquiry with actual GenZ students. There must be a better way.