With the increasing digitization of economic and social transactions and the rapid growth of user-generated data, unprecedentedly large amounts of textual data are being generated, such as customer reviews on e-commerce platforms and discussions in online communities. However, to obtain real insights from such data, effective computational methods that leverage the power of big data analytics are needed. Rising to this challenge, HKUST’s Yi Yang and colleagues propose a novel approach to the most widely used of such methods—topic modeling.
“Topic modeling has become the dominant tool used by information systems (IS) and business discipline researchers to explore and extract insights from textual data,” note the authors. Topic modeling methods are “powerful tools for analyzing massive amounts of textual data.” However, they are not without limitations. “Traditional statistical topic models are unsupervised and are trained using only textual data,” say the authors. This is a problem because many real-world platforms also include useful auxiliary metadata, such as numerical ratings that accompany textual reviews on online e-commerce platforms.
Unsupervised topic models ignore such data. As a result, the authors say, “the identified topics and variables derived based on the learned topic model may not be accurate, which could lead to incorrect estimations that affect subsequent empirical analysis and to inferior performance on predictive tasks.”
Such auxiliary metadata are challenging to include using Bayesian-based topic modeling approaches, the authors point out. Doing so requires the data generation process to be explicitly defined, which “limits the generalizability of the resulting model with regard to capturing complex semantic relationships in text collection.” This may result in poor model fit.
To overcome these challenges, the authors propose “a novel deep-learning-based neural topic modeling method that leverages the useful auxiliary metadata associated with the text.” This innovative approach, known as supervised deep topic modeling (sDTM), combines unsupervised neural topic modeling and supervised recurrent neural networks in a single framework, thus achieving the benefits of both supervised deep learning and unsupervised topic modeling.
The researchers then demonstrate the effectiveness of sDTM for high-quality topic modeling by conducting empirical case studies and predictive analytics using online consumer review data and online knowledge community data. “Not only can the approach alleviate concerns about measurement errors caused by using inaccurate machine learning models in empirical analysis,” they say, “but our supervised model greatly improves prediction accuracy in predictive analytics.”
This novel approach will prove useful not only to information systems researchers but also in other business disciplines, such as finance and accounting. For example, the researchers say, sDTM can effectively analyze topical information in corporate disclosures such as 10-K and earnings conference call transcripts, which are often extremely lengthy.
sDTM may also benefit practitioners, the authors note. It is a “plug-and-play” tool, requiring no significant changes to be made to text analysis pipelines. The authors have also made sDTM open-source to maximize its reach and accessibility. “We hope our development of a high-quality supervised deep topic modeling approach can help IS and business discipline scholars and business practitioners to harness the power of machine learning methods in understanding and gaining insights from unstructured text data,” they conclude.