You Can See The Specialist Now

I’m becoming increasingly convinced that the conversational AI future is a mixture of general (foundational) large language models (LLMs) that can provide a high-level diagnosis of a situation or question, and which then delegate to LLMs for specialized reasoning. The general LLM is used to process generic language to orchestrate calls to specialized services and LLMs with deep domain knowledge, and then to potentially summarise and synthesis the results back into a general form for the end-user.

Breaking the Language Barrier: Why Large Language Models Need Open Text Formats

Foundational LLMs are trained on huge corpuses of text collected from the public Internet, including websites, books, Wikipedia, GitHub, academic papers, chat logs, Enron emails (!) etc. One of the better known public collections of training data is called The Pile and is an 800 GB dataset of diverse text for language modelling.

In this article I will examine how the training sets for LLMs should influence your choice of data formats and best-practices for data formats that can be generated by LLMs.

