Online duplicate detector

12/11/2022

The data management process is characterised by a set of tasks where data quality management (DQM) is one of the core components. The paper concludes by discussing challenges to moving near-duplicate detection into operational rulemaking environments. Experimental results demonstrate that DURIAN is about as effective as human assessors. This paper presents DURIAN (DUplicate Removal In lArge collectioN), a refinement of a prior near-duplicate detection algorithm DURIAN uses a traditional bag-of-words document representation, document attributes ("metadata"), and document content structure to identify form letters and their edited copies in public comment collections. The identification of exact- and near-duplicate texts, and recognition of unique text within near- duplicate documents, is an important component of data cleaning and integration processes for eRulemaking. A more serious concern is that form letter customizations can include substantive issues that agencies are likely to overlook. Large volume public comment campaigns and web portals that encourage the public to customize form letters produce many near-duplicate documents, which increases processing and storage costs, but is rarely a serious problem. This method has produced promising results following an extensive evaluation using a production-based test collection created by domain experts. Finally, we investigate a flexible method of characterizing and comparing documents in order to permit the identification of non-identical duplicates. We then examine the utility of document signatures in addressing identical or nearly identical duplicate documents and their sensitivity to collection updates. We begin with a study of the distribution of duplicate types in two broad-ranging news collections consisting of approximately 50 million documents. Our research is divided into three parts. We also consider mechanisms for updating training collections while mitigating signature instability. This technique appears to offer a practical foundation for fingerprint stability. We explore alternative solutions that benefit from the development of massive meta-collections made up of sizable components from multiple domains. We show that even with very large training collections possessing extremely high feature correlations before and after updates, underlying fingerprints remain sensitive to subtle changes. When an enterprise proceeds to freeze a training collection in order to stabilize the underlying repository of such features and its associated collection statistics, issues of coverage and completeness arise.

One of the challenges of this approach, however, arises from the fact that in production environments, collections of documents are always changing, with new documents, or new versions of documents, arriving frequently, and other documents periodically removed.

A representative technique constructs a 'fingerprint' of the rarest or richest features in a document using collection statistics as criteria for feature selection. Recent work has focused on using some form of signature to characterize a document in order to reduce the complexity of document comparisons. Our goal in this work is to investigate the phenomenon and determine one or more approaches that minimize its impact on search results. Few users wish to retrieve search results consisting of sets of duplicate documents, whether identical duplicates or close matches. As online document collections continue to expand, both on the Web and in proprietary environments, the need for duplicate detection becomes more critical.

0 Comments

Online duplicate detector

Leave a Reply.

Author

Archives

Categories