An Updated Text Analytics Primer: Key Factors in a Text Analytics Strategy
About 90% of all the data in the world we have created in the last 24 months — averaging 2.5 quintillion bytes per day — and about 90% of that is unstructured data, which is things like texts, Tweets, pictures, and videos (Griffith, 2018). The goal of discovering meaning and purpose in this electronic torrent has created the industry of text analytics (a.k.a. text mining).
This article is designed for decision makers interested in the benefits of text analytics and looking for key considerations in how to form a high-level strategy. The article provides a concise overview of major points and decisions in a 15-minute brief instead of taking the time to comb through and understand hundreds of pages of textbooks and journal articles. Think of this article as a “cheat sheet” for decision makers outside of, or new to, data science to quickly get up to speed about key issues in text analytics.
The term “text analytics” came into use in the mid-2000s; however, its predecessor, “text mining” was used for decades earlier and appears to have originated in its modern computational form in the early 1980s in the intelligence community and life-sciences industry (Hobbs, 1982). However, the origins of text analytics is older by at least a century. Thomas Mendenhall, a physics professor at Ohio State and Tokyo University, used statistical methods to analyze curves of composition characteristics in a paper published in the journal Science in March of 1887 (Mendenhall, 1887). Claude Brinegar analyzed the writings of Quintus Curtius Snodgrass in March of 1963 in the American Statistical Association Journal to prove it was a pseudonym for Mark Twain, itself a pseudonym for Samuel L. Clemens (Brinegar, 1963). And, Frederick Mosteller of Harvard University and David Wallace of the University of Chicago used the statistical model of Naïve Bayes with Poisson and negative binomials in June 1963 to analyze which of the US founding fathers, Alexander Hamilton or James Madison, wrote passages of the Federalist Papers (Mosteller, 1963) (Madigan, 2019).
Text analytics popularity aligned largely with the ubiquity of email and spam because it was and is commonly used to auto-detect which messages are probably unwanted solicitations. However, text analytics needs had a second explosion point with the advent of Smartphones and texting and a third explosion with the advent of social media. As billions of users began posting to Facebook, Twitter, Instagram, YouTube, and other platforms, it fueled the generation of massive quantities of data, the vast majority being unstructured, and much of it text based.
What is text analytics?
Earlier statistical methods and rules-based AI depended largely upon quantitative fields — numbers — and statistical models to calculate descriptive statistics and roughly model the real world with regressions, Bayes Theorem, etc. Therefore, the questions businesses and other organizations would like to answer about the facts, relationships, and sentiments in these text blocks are largely locked and inaccessible for analysis without text analytics.
Text analytics attempts to understand the text itself, identify or categorize the author(s), or connect the text to something tangible in the world, or an event. This means the processing of text analytics can — but doesn’t always necessarily — involve computational linguistics, content analysis, information retrieval, and natural language processing. Much of text mining involves breaking text into fragments or symbols — think key words — and applying weights to them, which are often determined by how many times a word or phrase is used in a passage compared to how often it normally appears. Other features often include the length of words and their functions (Madigan, 2019).
Arguably, the greatest interest in text analytics, that which is developing the fastest for the biggest impact, is sentiment analysis. Not only is it helpful in categorizing authors by their words and behaviors to enable the development of customized marketing, persuasion, or treatment, if one can measure sentiment from text fields, then one can also measure and predict shifts in sentiment (more on this later).
Traditionally, statistical methods, models, and tools were used to classify text in a binary way (good/bad). These included discriminative models such as linear and logistic regression and classification and regression trees (CART), and generative models such as Bayesian classifiers and linear discriminant analysis (LDA) (Madigan, 2019). With the advent of inexpensive and massive computing power and storage in the late 2000s, machine learning models began to be more commonly used beginning with supervised machine learning models such as support vector machines (SVM).
Machine learning differs from traditional statistical methods in two ways: (1) it is a robust brute-force application of statistics; and/or; (2) it uses algorithms unique to computational analysis that enable software to discover and learn things, much like humans would if we had unlimited or very high cognitive function, recall, and perfect vision. To that end, machine learning is an evolution beyond statistics and, essentially, is a tool to augment humans to discover and predict things we otherwise could not.
The All-important Research Questions
Oversimplified, and varying based on the domain application, the key question in forging a text analytics strategy and approach essentially is: “of the dozens of machine learning algorithms available, with hundreds of variations, which algorithms should be used and in what sequence or cycle(s)?” It is here that data science and machine learning generalists earn their keep many times over because they often distinctly have a broader and cross-disciplinary knowledge base of dozens of algorithms, when and where to use each, and the strengths and weaknesses of different combinations of them, called ensembles. From this foundation, the strategy and approaches inform selecting initial and secondary approaches, hiring a team, a timeline, a budget, and determines the feasibility of what is possible and when.
The second research question that is key relies upon subject-matter or domain expertise. Namely, what is one trying to discover or predict and from what? These specific goals and objectives usually come from team members who are well-versed in the application domain space and will make themselves available to answer (and re-answer) questions, conduct root-cause analysis, and interpret draft findings, or review training and test data outcomes.
Factors — Key Considerations
Arguably, the answer to the research question — which algorithms to use when, why, and where — is informed by the answer to four prerequisite functional questions or considerations: (1) whether to use ready-made text analytics software, custom build, or use a hybrid; (2) the availability, accuracy, diversity, and labeling of text sources to serve as training data; (3) to use supervised or unsupervised machine learning, often determined by circumstances; (4) the lexical coverage and vocabulary needs in the domain application; and, (5) the timeline and budget.
Alternatively, the plethora of approaches and algorithms can also be narrowed by classic “what do you want to do” questions about the project (see the above decision-tree figure); however, correctly answering these questions often requires some knowledge of text analytics and data sources that decision makers may not have. Therefore, most of this article focuses on the first set of prerequisite questions, which are more readily answerable.
1. Commercial off-the-shelf, Custom-built, or Hybrid Software
There are at least 83 commercial off-the-shelf (COTS) software solutions for use in text analytics that have enough customers to give them reviews and comparative rankings. Many have slightly different applications and use cases; however, are essentially electronic cousins. None consistently rank above the 95th percentile; however, DiscoverText is close.
A secondary analysis published here for the first time of aggregate scores of editor and user reviews comparatively ranks the leaders in this field (see below Figure and Table). Knowing who the leaders are to consider is just as important as knowing who the laggards are to exclude from consideration. Generally, fee-based solutions consistently outperform free solutions by a significant margin. The names of providers who scored average have been culled in the visualization to allow room to easily see the leaders, the laggards, and where solutions from major companies rank (e.g., Apache, Google, IBM, Microsoft, Oracle, etc.). Solutions in the upper right had the highest aggregate scores; solutions on the lower left had the lowest aggregate scores.
For the vast majority of applications, one or more of these ten applications is sufficient and represent the best value of time and resources to get very good, albeit far from perfect, results on major steps in the text analytics process. In many cases, choosing a larger company’s product (e.g., IBM, Oracle, etc.), even if they are ranked lower than some others, can be beneficial because they have broader and more dependable support and documentation than smaller companies or startups with higher ranked products.
If there is a business case to build a custom system to achieve superior performance, or because of idiosyncrasies in the domain application, then two approaches or combinations of algorithms have repeatedly performed well for classification and sentiment analysis, two of the largest steps of text analytics: (1) support vector machine for supervised learning; and, (2) combinations of recurrent neural networks (RNNs), convolutional neural networks (CNNs or ConvNets), and long-short term memory for unsupervised learning (LSTM)(Wang, 2016) (Yang, 2019). In almost all instances, a free or low-cost installation of Anaconda with RStudio will allow the thousands of R libraries already written to be reused in combination for most ensembles of algorithms. A list of algorithms traditionally used for text analytics can be seen in the following table.
2. Identifying & Preparing Voluminous, Accurate & Diverse Training Data
Selecting text data sources is half the work because training data, and its unavailability or usability, is a dirty secret of machine learning. Practitioners and surveys frequently state that 80% of the time and effort of data science is in data preparation, which for text analytics is usually text sources (written or audio). To be effective, machine learning text analytics usually needs to have a large quantity of labeled training data to teach the algorithms to learn to perform their prediction tasks. Inherent here is that the data be complete, clean, diverse, readily available in the future, and causal — not merely corollary. Moreover, there has to be enough causally associated, labeled, and clean data to run experiments on different models to determine their comparative effectiveness in predicting via train-test cycles.
For example, if one was trying to predict economic growth or contraction, they would need a large enough quantity and diversity of clean and labeled data to be able to determine which of those data elements were causal — or telltales — as to when the economy was growing or contracting, by how much, and at what acceleration or deceleration (arguably, this would probably be best done quickly and reasonably using random forest decision trees from Salford Systems, which could easily determine which elements were causal and in what proportions). Once key causal terms are identified, sentiment analysis can infer meaning, and clustering can monitor shifts in sentiment.
In many cases, this prerequisite is a focus of substantial time and resources and rightfully so. It may, or may not, be possible to predict economic expansion or contraction based on — for example, companies’ 10K filings from the Securities and Exchange Commission (SEC), or news articles, or statements by Federal Reserve Board members. It may take many data fishing expeditions to find the data elements that are most causal or predictive of the wanted outcome, after which, training a machine learning system to monitor and predict shifts is only the second half of the job.
Once data sources have been identified and causality confirmed in its features and labeled, data cleansing and organization can be deftly handled by a dozen or so R functions in easily accessible and well-versed libraries.
Alternatively, text analytics can be used to identify and monitor trends. This is more of an analytical function that is faster and easier (read cheaper) to do because it isn’t trying to predict outcomes. Typically, this means the training is different in a way that is faster and cheaper to do. The analysis is focused more on concept extraction than causality and prediction. Even these sentiment trends though can be monitored overtime to detect changing sentiments as clusters shift.
3. Choosing a Supervised or Unsupervised Machine Learning Approach
The decision to use supervised or unsupervised (more automatic) machine learning for text analytics largely depends on which step in the process is being performed, how much text needs to be analyzed how often, and how perfect it needs to be (e.g., sensitivity versus specificity). Smaller datasets that can be turned around more slowly, or with strategic timelines, may lend themselves more to supervised machine learning approaches. Larger data sets that need to provide streams of recommendations or predictions, regardless of timelines, lend themselves to unsupervised machine learning approaches, provided they can deliver the desired degree of accuracy, sensitivity, and specificity.
4. Lexical Coverage
One of the key steps in applying or adapting text analytics to different domains is preparation to infer sentiment, or feature extraction. Historically, something called n-grams were created by converting words into tokens that could be represented in binary (zeros and ones). Newer methods using transfer learning — reused elements from prior models — are now both faster and far more accurate. In the case of having a lexicon to predict and classify sentiment, FastAI, which is a model pre-trained on Wikipedia, is an advanced starting point for building a customized language model for specific domain applications. By using the FastAI language model, a specific domain adaptation can be built atop it followed by a classifier for maximum development and processing feed and accuracy (95% accuracy is possible). There are often other domain adaptations that must be taken into consideration; however, lexicon and vocabulary are universally major considerations.
5. Budget, Resources & Timelines
There is an aphorism that requests can be fulfilled cheaply, perfectly, or quickly, and you only ever get to pick two of those requirements because all three are impossible. My 20 years experience applying predictive analytics and data science has proven that correct.
Money, talent, and time restrict most projects, including those in data science, and it is a positive thing that they do, else spending would be ad nauseam and a positive return on the investment would be highly improbable. Decisions taken regarding which tools to use often depend on the tools available. Similarly, it may depend on the talent resources a project has access to and for how long. A classic example in data science is whether to use the R statistical language or Python general language. While there are advantages and disadvantages to both (I prefer R), it may depend on what language you have available to you on your team, which languages your teammates know.
Regarding timelines, supervised text analytics is far more labor intensive (and thus, could take longer); however, custom coding, feature extraction, and tuning of unsupervised machine learning algorithms, especially for predictions, can also be time intensive. A larger team probably lends itself to supervised approaches and a smaller team (and more data) probably lends itself to unsupervised approaches. In the section on value, there’s a suggestion about how to balance these considerations starting a new text analytics project, which often aim for quick wins to gain credibility and more funding for future work.
Advanced Issues — Predicting Sentiment Trends & Explainability
If there is a dirty secret of data science other than training times and efforts, it’s explainability. Most of the consumers and decision makers employing machine learning are executives without a specialized background in data science, or even statistics. Or, even if they do have such backgrounds, machine learning is far enough away for their domain expertise and moves so quickly as to make its value recognizable but not known in detail or thoroughly. Therefore, for these decision makers to use the predictive models they must trust them, and to trust them, they must understand them. Hence, explainability becomes a key consideration.
One way to maximize explainability is to use visualizations for feature analysis. One favored visualization, which is an unsupervised machine learning technique in and of itself, is self-organizing maps (SOMs). SOMs classify data into clustered segments based on similar traits. In ecommerce, for example, they cluster groups of customers into those that have high spending but low frequency, or high frequency but low spending. In ecommerce, this informs how a group can be targeted with marketing or behavioral interventions.
In sentiment analysis, the positivity or negativity and strength or weakness of sentiments can also be clustered using self-organizing maps (SOM). If this analysis is repeated in different time frames, it is also possible to see how sentiment is shifting, in what direction, and at what speed or rate. From these shifting clusters one can calculate probabilities of future directions with confidence intervals. This necessitates a cyclical approach of ingest, analyze, repeat, and adds a clustering function to words, for example, after the concept extraction.
For example, imagine seeing a shift towards economic contraction registered by word usage in corporate 10K filings that is increasing at an increasing rate. One could theoretically calculate the probability of recession, or unemployment claims, or borrowing or savings rates, with reasonable accuracy. The ability to predict with probability where an economy is heading allows for preventive interventions to maximize its outcomes.
These economic predictions can also be made based on consumer or corporate sentiment in text-based Big Data for different industries, geographies, or both and assembled into multi-dimensional databases for visualizations. This would allow organizations to more accurately predict which sectors or regions are cooling and which are warming at different times to different degrees, or what issues are trending where, and sometimes, why.
Key here is the notion of predictiveness. Historically, significant latency was introduced between when observations were made, recorded, analyzed, and new decisions were taken. These tools allow latency to be largely ameliorated and shifted into preventative policy or management, which also applies to discovering trends, not just causal predictions.
At the end of the day, explainability is such a critical issue that experts will often recommend that it is better to have an understandable and explainable approach or model over one that might perform slightly better but is inexplicable.
In a perfect world, prototypical projects can be built as ‘proof-of-concepts’ that can also be operationalized into production with valuable new insights quickly with investments measured in $10,000s instead of $100,000s. Once these prototypes are socialized and gain trust, the budget, timeline, and sophistication of text analytics can evolve, during which time substantially more text data is also almost always available to better train and inform outcomes and predictions and trend analysis. The possibilities of what can be done with text analytics grows every day because of the rapid growth of the corpus of text-based communications over time.
At the end of the day, value or return on investment is what makes most business cases regarding how much budget — effort, time, talent — to invest in text analytics. A three to six-month project using COTS may be the most some organizations can spend to still have a positive ROI. Governments, multinational corporations, and healthcare have business cases to deploy text analytics, and other forms of machine learning, in evolutionary waves. Get a 60–75% predictor with COTS — or small-scale trend analyzer, then customize or expand (more geographic diversity, data inputs, specificity, or precision, or whatever are the objectives).
Text analytics can offer extraordinary insights into public sentiment for economics, finance, ecommerce, and social and geopolitical issues. In large part, this is because human behaviors on social media and the digitization of communications are creating such a massive corpus of textual data to mine. If you lead an organization or business in which the capability to predict consumer or large population sentiment is valuable, or analyze sentiment trends and how they shift over time in different cohorts, you probably have a business case to explore text analytics with real world experiments. Thereafter, it’s probably best to begin with a generalist data science expert who can quickly prototype proof-of-concepts using COTS tools as much as possible to socialize and explore the interest and benefits to learn if, when, and how larger investments and teams of custom solutions are likely to have a positive return.
Brinegar, C. (1963). Mark Twain and the Quintus Curtius Snodgrass letters: A statistical test of authorship. Journal of the American Statistical Association, 58(301): 85–96.
Griffith, E. (2018, November 15). 90 percent of the big data we generate is an unstructured mess. Retrieved from PC Magazine: https://www.pcmag.com/news/364954/90-percent-of-the-big-data-we-generate-is-an-unstructured-me
Hobbs, J., Walker, D., Amsler, R. (1982). Natural language access to structured text. Proceedings of the 9th Conference on Computational Linguistics (COLING ’82) (pp. 127–132). Prague, Czechoslovakia: Academia Praha.
Imanuel. (2019, August 21). Top 63 software for text analysis, text mining, text analytics. Retrieved from Predictive Analytics Today: https://www.predictiveanalyticstoday.com/top-software-for-text-analysis-text-mining-text-analytics/
Madigan, D., Lewis, D. (2019, August 18). Text mining: An overview. New York City, New York, USA: Columbia University.
Mendenhall, T. (1887). The characteristic curves of composition. Science, 9(214S): 237–246.
Miner, G., Delen, D., Elder, J., Fast, A., Hill, T., Nisbet, R. (2012). The seven practice areas of text analytics. In Practical text mining and statistical analysis for non-structured text data applications (pp. 29–41). Amsterdam: Elsevier, Inc.
Mosteller, F., Wallace, D. (1963). Inference in an authorship problem. Journal of the American Statistical Association, 275–309.
Wang, J., Liang-Chih, U., Lai, K., Zhang, X. (2016). Dimensional sentiment analysis using a regional CNN-LSTM model. Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (pp. 225–231). Berlin: Association for Computational Linguistics.
Yang, F., Du, C., Huang, L. (2019). Ensemble sentiment analysis method based on R-CNN and C-RNN with fusion gate. International Journal of Computers Communications & Control, 272–285.