Sensitive info masking in the age of AI // semantic analysis

Sensitive information encompasses more than just names, phone numbers, addresses, and IDs.

sbagency
4 min readApr 24, 2024

Masking sensitive information is not just entities/snippets replacement but also semantic analysis. It’s possible now thanks to LLMs.

Simple entities masking

https://www.holisticai.com/blog/managing-personal-data-in-large-language-models

Simple entity recognizer

https://spacy.io/api/entityrecognizer

More complex NER

https://universal-ner.github.io/

We propose a general recipe for targeted distilling where we train student models using mission-focused instruction tuning for a broad application class such as open information extraction. We show that this can maximally replicate LLM’s capabilities for the given application class, while preserving its generalizability across semantic types and domains. Using NER as a case study, we successfully distill these capabilities from LLMs into a much smaller model UniversalNER that can recognize diverse types of entities or concepts in text corpora from a wide range of domains. UniversalNER surpasses existing instruction-tuned models at the same size (e.g., Alpaca, Vicuna) by a large margin, and shows substantially better performance to ChatGPT.

https://docs.deeppavlov.ai/en/master/features/models/NER.html

Named Entity Recognition (NER) is a task of assigning a tag (from a predefined set of tags) to each token in a given sequence. In other words, NER-task consists of identifying named entities in the text and classifying them into types (e.g. person name, organization, location etc).

https://arxiv.org/pdf/2305.15444.pdf

In a surprising turn, Large Language Models (LLMs) together with a growing arsenal of prompt-based heuristics now offer powerful off-the-shelf approaches providing few-shot solutions to myriad classic NLP problems. However, despite promising early results, these LLM-based few-shot methods remain far from the state of the art in Named Entity Recognition (NER), where prevailing methods include learning representations via end-to-end structural understanding and fine-tuning on standard labeled corpora. In this paper, we introduce PromptNER , a new state-of-the-art algorithm for few-Shot and cross-domain NER. To adapt to any new NER task PromptNER requires a set of entity definitions in addition to the standard few-shot examples. Given a sentence, PromptNER prompts an LLM to produce a list of potential entities along with corresponding explanations justifying their compatibility with the provided entity type definitions. PromptNER achieves state-of-the-art performance on fewshot NER, achieving a 4% (absolute) improvement in F1 score on the ConLL dataset, a 9% (absolute) improvement on the GENIA dataset, and a 4% (absolute) improvement on the FewNERD dataset. PromptNER also moves the state of the art on Cross Domain NER, outperforming prior methods (including those not limited to the few-shot setting), setting a new mark on 3/5 CrossNER target domains, with an average F1 gain of 3%, despite using less than 2% of the available data.

https://arxiv.org/pdf/2402.10573v2.pdf

Named Entity Recognition (NER) serves as a fundamental task in natural language understanding, bearing direct implications for web content analysis, search engines, and information retrieval systems. Fine-tuned NER models exhibit satisfactory performance on standard NER benchmarks. However, due to limited fine-tuning data and lack of knowledge, it performs poorly on unseen entity recognition. As a result, the usability and reliability of NER models in web-related applications are compromised. Instead, Large Language Models (LLMs) like GPT-4 possess extensive external knowledge, but research indicates that they lack specialty for NER tasks. Furthermore, non-public and large-scale weights make tuning LLMs difficult. To address these challenges, we propose a framework that combines small fine-tuned models with LLMs (LinkNER) and an uncertainty-based linking strategy called RDC that enables fine-tuned models to complement black-box LLMs, achieving better performance. We experiment with both standard NER test sets and noisy social media datasets. LinkNER enhances NER task performance, notably surpassing SOTA models in robustness tests. We also quantitatively analyze the influence of key components like uncertainty estimation methods, LLMs, and in-context learning on diverse NER tasks, offering specific web-related recommendations.

https://billtcheng2013.medium.com/named-entity-recognition-with-spacy-and-large-language-model-6716e61913ea

Data masking

https://www.zendata.dev/post/data-masking-what-it-is-and-8-ways-to-implement-it
https://learn.microsoft.com/en-us/sql/relational-databases/security/dynamic-data-masking?view=sql-server-ver16
https://pub.towardsai.net/all-you-need-to-know-about-sensitive-data-handling-using-large-language-models-1a39b6752ced

--

--

sbagency

Tech/biz consulting, analytics, research for founders, startups, corps and govs.