Header Menu
G. N. Shah December 14, 2020 No Comments

Incorporating NLP Capabilities into Innovatix’s WebMineR

Executive Summary Innovatix Technology Partner’s new product, WebMineR, is a state-of-the-art web scraping cloud application. It has over 25 best-in-class capabilities, and is highly scalable, efficient, secure and configurable. The list of major features continues to grow, as shown in the latest Roadmap posted on our web site. In addition, several recent papers on Innovatix’s web site discuss in detail these 25 major capabilities, and how they compare to other products in the marketplace. This paper starts a dialogue on some of the newest functionality to WebMinerR which are centered in the area of NLP, specifically text summarization and topic modeling. These are clearly important features to have in a web scraping technology. We expect these capabilities to be fully available by second quarter 2021. The purpose of this paper is to describe the research into NLP we are now doing to make sure these capabilities perform well when delivered in WebMineR. Text summarization and topic modeling are two of the most prominent use cases in the field of Natural Language Processing (NLP). Text summarization allows one to understand the basic idea of a block of text without having to manually read and summarize the document. Topic modeling on the other hand provides the ability to extract main topics from a block of text and thus get the key ideas of the paper, again without manually reading it. That is at least the concept behind these two important NLP capabilities. Certainly, the technologies are getting better with time, as AI/NLP continues to move forward at a tremendous pace. How well they actually work today for text extracted from a wide range of web sites (in many different languages) is still an open issue in our minds. For sure it is critical to know how to best configure and use the NLP algorithms we select to use in these two areas in order to get the best possible out of them. Hence, that is the goal of our current research in this area and below we report on some of our findings to date. We are doing this testing and research in order to ensure they are reliable when we incorporate them into our web scraping system. WebMineR is built to scrape public information off web sites at ultra-high throughput capacity. The addition of text summarization and topic modeling to the tail end of the WebMineR process would be an obvious major benefit to our user community.   This paper contains a detailed background on both these subjects – text summarization and topic modeling – along with some results of trials and case studies of different off-the-shelf NLP systems and algorithms we have tried out so far. We expect to continue this testing and research through the end of the year. The tests we show here are on healthcare-related documents. In addition to English, we show how these capabilities can be applied to Mandarin language documents as well. Overall, our research and testing results show reasonably good performance using NLP methods to extract summaries and topics from documents. We feel strongly this will add useful new capabilities to WebMineR. Stay tuned! Text Summarization Text summarization is the task of creating a concise and accurate summary that represents the most critical information and has the basic meaning of a longer document’s content. It can reduce reading time, accelerate the process of researching for information, and increase dramatically the amount of information found within an allotted time frame. Text summarization can be used for various purposes such as medical cases, financial research, and social media marketing. Text summarization can broadly be divided into two categories: Extractive summarization and abstractive summarization. Resources/libraries that are commonly used for text summarization include: Natural Language Toolkit (NLTK) and Spacy, and pre-trained models like Bert and T5. NLTK and Spacy are open source libraries used for NLP on Python. Bidirectional Encoder Representations from Transformers (BERT) is a technique for natural language processing pre-training developed by Google. BERT is pre-trained on a large corpus of un-labelled text, including the entire Wikipedia(2,500 million words) and Book Corpus (800 million words). You can fine-tune it further by adding just a couple of additional output layers to create state-of-the-art models for your specific text mining application. T5 is an encoder-decoder model and converts all NLP problems into a text-to-text format, and it is trained on a mixture of unlabeled text (C4 collection of English web text). In our test use cases, we used NLTK, Spacy, and BERT for extractive summarization, and we used T5 for abstractive summarization. We applied each of the 4 tools to implement text summarization. Here is an example that shows the summaries that we generated using the different tools. We first load a text file (PDF format) and then applied the four text summarization methods. Text summarization by using NLTK library first uses Glove to extract words embedding and then uses cosine similarity to compute the similarity between sentences and apply the PageRank algorithm to get the score for each sentence. Based on the sentence scores it put together the top-n sentences as a summary. Spacy library will tokenize the text, extract keywords, and then calculate the score of each sentence based on keyword appearance. Text summarization using pre-trained model BERT will first embed the sentences and then run the clustering algorithm to finally find the sentences closest to the centroids and use those sentences as the summary. T5 converts all NLP problems into a text-to-text format, and the summarization is treated as a text-to-text problem. The model we used is called T5ForConditionalGeneration, and it loads the T5 pre-trained model to extract a summary. The document we used to implement text summarization is a summary of the European public assessment report (EPAR) for the drug Fosavance. It explains how the Committee for Medicinal Products for Human Use (CHMP) assessed the medicine to reach its opinion in favor of granting a marketing authorization and its recommendations on the conditions of use for Fosavance. After analyzing the 4 summaries, the