Today’s organizations create and manage more digital content than ever before. Much of it is stored as documents, spreadsheets, text, and media files, frequently containing personal information and sensitive, financial, or healthcare data. Often, embedded in the text are also critical terms and key phrases that can identify the themes, subjects, and significance of a document. Unfortunately, the immense volume and velocity of such data makes it difficult to separate relevant, high-value information – from noisy, low-quality data. Inability to turn data into knowledge can have a crippling effect on human decision making, often resulting in poor judgement, and misunderstanding of risk.
Sorting through data across an enterprise can be an expensive and cumbersome process. Few organizations are equipped to handle data classification by traditional (manual) methods. Automation tools can help streamline the process, but an enterprise must determine the categories and criteria that will be used to classify data and clearly define the objectives. Using Cognitive AI to analyze and classify data at scale can significantly improve knowledge worker productivity and decision making by automatically discovering useful patterns, trends or data relationships based on semantic meaning and context.
Thematic Analysis focuses on collecting and analyzing text, document, image or audio data to understand concepts, opinions, or experiences. This is sometimes called Classification, Entity Extraction or Topic Modeling although in practice these are separate functions (services) that make use of different algorithms for quantitative analysis. Classification relies on data clustering techniques, that can automatically identify key terms or phrases (features) that describe a document or text. Classified texts may then be used as data dictionaries or labeled sets to train other analytic services. Thematic Analysis services make use of "shallow" neural networks (often called word vectors) and data grouping techniques to perform targeted text analysis that can discover similar terms, related phrases (engrams) and semantic relationships between words.
Thematic Analysis is often used to gain deeper insight into a problem, identify themes (topics), synonyms, antonyms and key phrases or to generate new ideas for research. Identifying these feature sets allows machines to work with semantic terms rather than strict data values. Understanding human semantics Cognitive AI can discover meaning and context by analyzing large volumes of related text, and build Semantic Graphs (semagraphs) representing new knowledge . Such specialized knowledge graphs may be generated using a cognitive automation process without human interaction. Re-computing of the semagraphs may also be triggered by changes in critical information or the arrival of new data; a process known as unsupervised learning.
If you cant identify it, you cant measure it ! The success of any Machine Learning algorithm depends on our ability to identify and analyze meaningful data points or measurements, often referred to as a model's features. Features describe critical things about data that best characterize a problem you are trying to solve or a question you are trying to answer. Errors in Predictive Analytics typically occur because we pick the wrong things to measure.
Cognitive AI services train their models by evaluating a collection of documents or text that identify a subject matter or a collection of related information. Such training sets must first be prepared for analysis and then "cleaned" to eliminate certain irrelevant parts of speech that do not convey any valuable, identifying information. This is generally referred to as Tokenization. However, specialized Cognitive AI technologies such as StreamScape's Data Fabric typically provide a number of additional capabilities such as creation and inclusion of Data Dictionaries, injection of ontology tags such as Synonym, Hyponym or Meronym, automatic conversion of complex terms into their root elements (Lemmatization) and dynamic generation of N-Grams (key phrases) that help identify content.
The process of identifying and fine tuning terms and phrases that categorize data is called Feature Engineering. Data scientists spend most of their time on feature engineering, which can be a slow and labor-intensive process. Training data is often gathered from disparate sources and may require significant preparation before it can be used. But identifying the right features makes a big difference in model accuracy. Feature Engineering is about finding out what measurements are meaningful for predicting outcomes and prescribing actions for decision makers to take.
Synonyms, Hyponyms, Meronyms
Given the highly specialized nature of thematic analysis, it may be desirable to create knowledge graphs that are domain-specific. For example using medical research documents specific to cancer or vaccine efficacy trials can produce very specialized semagraphs linked to a focused area of research, that understand the vocabulary of a given domain.
In cancer research, for instance the term "life saving" means something very different than in vaccine research. The primary goal of cancer treatment at this time is to extend human life by some significant measure. Whereas the goal of vaccines is to reduce symptoms and eliminate fatal outcomes. Yet the machine can be trained to understand such subtle, contextual differences quite easily. And it can also be trained to easily sort out documents and texts related to vaccines from those that discuss cancer research, compare such documents to similar texts; and even identify documents with overlapping interests, such as those theoretical works that discuss cancer vaccines. How is this done?
Thematic Analysis and Classification services make use of Data Dictionaries that understand Part of Speech, Root Words and Functional Taxonomies of terms. For example a hyponym refers to a type. A meronym refers to a part. A hyponym of a tree is a pine tree or an oak tree (a type of tree), but a meronym of tree is bark or leaf (a part of tree). Likewise the dictionary may tag a term as a synonym or antonym. This type of functional taxonomy of a terms (semantics) can make it easy for machines to filter or match on specific types of terms and allows the user to coerce features into a specific category allowing for very detailed and contextual terms or phrases to be associated with a category. Such taxonomies are often created by Data Scientists during the Feature Engineering phase of model training. The features can then be used to narrow down a subject or document during search, build knowledge graphs that relate content by very specific, semantic meaning or predict terms that should appear in data that make it fit into a specific category. In Cognitive AI parlance we call this Semantic Coercion. The ability to inject synonyms, meronyms, antonyms of hyponyms into the model training process, that allows for fine-grained Domain Specific Meaning to emerge during Classification.
The Future of Thematic Analytics
The field of data classification, semantic and thematic analysis is an emerging discipline with many practical applications. It underpins technologies such as Chat Bots and plays a critical role in Generative AI, allowing GPT engines to match the users prompts to specific texts in a corpus, used to generate responses; thereby creating the critical illusion of subject knowledge. Understanding themes and semantics is increasingly important given the enormous amount of text and documents being generated by modern systems. Research indicates that about 30% of all documents read by specialists are mis-categorized. In some area as high as 40% when industry-specific jargon is used. Contextual analysis performed by Cognitive AI can greatly reduce or eliminate the manual effort of document review and categorization, providing more accurate results then a typical human reviewer can. Techniques like Semantic Coercion can inject domain-specific information into the model, improving the factual accuracy of information recall.
Cognitive AI services make use of Thematic Analysis and Classification to discover, organize, and retrieve information based on similar content, unique features, or semantic meaning. Data classification can also be used to establish intent or sentiment in a body of text and predict features that are unknown or may appear in the future. Classification and Thematic Analysis have broad application in decision support systems, legal due diligence, direct marketing, regulatory compliance or data privacy, insurance fraud detection and medical diagnosis.
Find out more about this capability and StreamScape's available tool set at https://www.streamscape.com/product/use-cases/classification.