There is immense potential in the area of generative artificial intelligence (AI) to assist in the document analysis of large data sets. As a proof-of-concept I combined the Substantial Equivalence (SE) Final Rule document with several SE reviewer guides and scientific policy memoranda related generally to FDA review of tobacco product applications and specifically to SE Reports. I used this data set because the SE review process at CTP is more mature and there is more information publicly available to analyze. There have also been many more market authorizations for SE so the training data should theoretically be more "trainable". After cleaning up the combined document by removing redacted pages and OCR'ing certain text elements I used petal.org's document analysis platform (free version) to upload, extract and process the ~455 page document (1.3M characters). From my current (and limited) understanding this involves a process called "chunking" which is a technique used to organize and process information efficiently.
At this point there are so many new AI and natural language processing (NLP - a subfield of AI) software technologies and techniques being developed almost weekly (both open-source and proprietary) that it's basically impossible to keep track of it all. My focus for this exercise was to leverage currently available technology to do a job and in this case it is to quickly analyze a large public data set to answer specific questions about SE so I don't have to spend time repeatedly reading reviewer guides, reference documents and final rules when in the process of doing work. I was in the SE trenches for many years and would’ve greatly appreciated a tool like this.
To be sure this technology goes way beyond CTRL+F search for keywords in documents and I'm consistently impressed with the output it generates, but information is not knowledge. That being said, there is an art and a science to asking the correct questions (prompts) to get the information you want and it does require a substantial amount of background knowledge to use the tools effectively and confirm that the data is generally correct. I curated a list of 10 prompts for this document analysis which I've listed below. To quantify the time savings, this 10-question prompt sequence may have taken me an hour to complete by reading the documents and typing responses at the same quality and depth of content. This entire prompt-response sequence using this tool took probably less than 5 minutes in total. As noted this was a proof-of-concept exercise, kind of like conversing with the document via chatbot, so I created prompts that were general enough to provide relevant information for someone new to the data. This type of document could be used as a new-hire onboarding document, regulatory subject matter expert reference or a glossary of terms for an application.
I was concerned about repeatability of output so I input the same prompts in different sessions. The responses did not seem to change much. Probably because this was a simple document analysis more related to NLP and not taking into account broader AI models (OpenAI probably being the most well-known) which are becoming more familiar but still a bit hallucinatory. Building quality data models requires good curation and computing power. Future direction will include better extraction methods for data tables from pdfs (HPHC and ingredient data for SE and PMTA) as well as running local large language models (LLM) for more precision and better data security around proprietary information.
Explain the substantial equivalence process.
Provide information related to how a toxicology reviewer assesses an application. What are the most important considerations?
Provide an overview of the information used to support social science review of SE Reports.
Provide a summary of the review used to support behavioral and clinical pharmacology for SE Reports.
Explain equivalence testing for SE Reports.
What is a surrogate tobacco product?
Provide an overview of stability testing for smokeless tobacco products.
Is clinical data needed in order to review a SE Report?
Explain what a different question of public health means.
What are 2 of the deficiencies most commonly seen in SE Reports?