Empirical study of constructing a knowledge organization system of patent documents using topic modeling

Empirical study of constructing a knowledge organization system of patent documents using topic modeling
GTM-2012 Global Tech Mining Conference(2 ; Montreal, Quebec, ; 2012-09-05)
Analyse composante principale
Principal component analysis
Knowledge organization system
Principal Component Analysis
Term clumping
Text clustering
Topic model
Tech Mining, Analysis, and VisualizationScientometrics (Print). 100(3):787-799
Dordrecht: Springer, 2014.
print; 13; 3/4 p
Conference Paper
University of Chinese Academy of Sciences, No.19A Yuquan Rd., Beijing 100049, China
Chengdu Document and Information Center, Chinese Academy of Sciences, No. 16 South Sec.2 Yihuan Rd., Chengdu 610041, China
Georgia Institute of Technology, Atlanta, GA, United States
Search Technology, Inc., Norcross, GA, United States
Copyright 2015 INIST-CNRS
CC BY 4.0
Sauf mention contraire ci-dessus, le contenu de cette notice bibliographique peut être utilisé dans le cadre d’une licence CC BY 4.0 Inist-CNRS / Unless otherwise stated above, the content of this bibliographic record may be used under a CC BY 4.0 licence by Inist-CNRS / A menos que se haya señalado antes, el contenido de este registro bibliográfico puede ser utilizado al amparo de una licencia CC BY 4.0 Inist-CNRS
A knowledge organization system (KOS) can help easily indicate the deep knowledge structure of a patent document set. Compared to classification code systems, a personalized KOS made up of topics can represent the technology information in a more agile, detailed manner. This paper presents an approach to automatically construct a KOS of patent documents based on term clumping, Latent Dirichlet Allocation (LDA) model, K-Means clustering and Principal Components Analysis (PCA). Term clumping is adopted to generate a better bag-of-words for topic modeling and LDA model is applied to generate raw topics. Then by iteratively using K-Means clustering and PCA on the document set and topics matrix, we generated new upper topics and computed the relationships between topics to construct a KOS. Finally, documents are mapped to the KOS. The nodes of the KOS are topics which are represented by terms and their weights and the leaves are patent documents. We evaluated the approach with a set of Large Aperture Optical Elements (LAOE) patent documents as an empirical study and constructed the LAOE KOS. The method used discovered the deep semantic relationships between the topics and helped better describe the technology themes of LAOE. Based on the KOS, two types of applications were implemented: the automatic classification of patents documents and the categorical refinements above search results.

