|
1.INTRODUCTIONMany vertical fields such as finance, medicine, manufacturing, logistics, electric power, and military have accumulated a large amount of structured, semi-structured and unstructured data, such as databases, tables, format messages, text, images, audio and video. These data are massive, complex and diverse, which have brought about serious information overload problems, making users easily fall into the dilemma of “information lost” and “knowledge shortage”. How to quickly, accurately and comprehensively extract high-value knowledge from industry data with a wide range of sources, large scale, scattered distribution and diverse modalities, is a major issue to be solved in the intelligent construction of various fields. This knowledge can support various users to accurately obtain data according to different tasks, and provide real-time situational awareness and auxiliary command decision-making. 2.KNOWLEDGE EXTRACTIONKnowledge extraction technology is an important means to solve the problem of information overload for huge amounts of heterogeneous data. As shown in Figure 1, knowledge extraction technology, which extracts structured knowledge such as concepts, entities, attributes, relations, events from structured, semi-structured, and unstructured data, provides a strong support for many applications, such as knowledge graph construction, semantic search, intelligent question answering, and intelligent decision. The research on knowledge extraction technology for vertical field is helpful to enhance the high usability and reusability of industry information, and realize the advantages of data-driven and knowledge-centered intelligent decision making. Knowledge extraction technology mainly includes many methods based on rules, patterns, machine learning models, and deep learning models1-11. Based on manual definition of patterns and rules by analyzing the location structure, word composition, occurrence frequency and other characteristics of knowledge, it aims at precise knowledge extraction rather than wide recall. The machine learning theory-based methods, which convert a knowledge extraction task into a multi-classification or sequence labelling task, train machine learning models (such as Conditional Random Field, Hidden Markov Model, Support Vector Machine) with feature set to perform the target tasks. Due to its powerful ability to automatically capture features, deep learning technology is widely used in knowledge extraction tasks by training neural network models (such as Convolutional Neural Networks, Recurrent Neural Networks, Long-Short Term Memory Network) using a large amount of labelled data. Table 1 shows the advantages and disadvantages of the above three mainstream knowledge extraction methods. Table 1.Advantages and disadvantages of mainstream methods of knowledge extraction.
3.CRITICAL PROBLEMS AND CHALLENGES OF KNOWLEDGE EXTRACTION IN VERTICAL FIELDStructured and semi-structured data in vertical fields usually have regular structure and fixed format, from which extracting knowledge is relatively easy. But the amount of knowledge acquired is limited, and it is difficult to meet the needs of large-scale knowledge application. Therefore, it is necessary to extract knowledge from unstructured data to expand the scale. Though unstructured data contains rich knowledge, there are diverse structures, insufficient samples, unbalanced categories, concise context and other problems. These problems propose new challenge to knowledge extraction technology.
4.TECHNICAL ARCHITECTURE FOR KNOWLEDGE EXTRACTION IN VERTICAL FIELDFocusing on the above problems and challenges, and fully considering the industry data structure, category, scale, quality and other factors, as well as the advantages of the current mainstream technology, the technical architecture of knowledge extraction in vertical field is designed as shown in Figure 2. The solution is as follows.
Through the above stages, multiple knowledge such as concepts, entities, attributes, events and their associated relations can be extracted from large-scale multimodal data. These strategies can solve the problems of industry knowledge extraction under the environment of massive multi-source heterogeneous data. They can also provide support for large-scale knowledge graph construction, natural language understanding, intelligent question-answering, intelligent decision-making and other applications. 5.CONCLUSIONKnowledge extraction technology can automatically mine valuable knowledge from massive, scattered and heterogeneous industry data, which plays an important role in natural language understanding, information mining, situation awareness and intelligent command decision-making, accelerating the arrival of knowledge-based intelligent construction era. This paper analyzes the key problems and challenges of knowledge extraction technology in vertical field, and designs a technical framework of knowledge extraction targeting for vertical field. This framework presents a detail solution for the key problems in extracting industry knowledge. In vertical field, the problems such as insufficient samples and imbalance categories are common, and the knowledge extraction methods based on small samples and transfer learning will become hot topics in the future. REFERENCESAstrakhantsev, N.,
“Automatic term acquisition from domain-specific text collection by using Wikipedia,”
in Proceedings of the Institute for System Programming of RAS,
7
–20
(2014). Google Scholar
Sari, Y., Hassan, M. F. and Zamin, N.,
“Rule-based pattern extractor and named entity recognition: A hybrid approach,”
2010 International Symp. on Information Technology, 563
–568
(2010). https://doi.org/10.1109/ITSIM.2010.5561392 Google Scholar
Liu, X. and Yu, N.,
“Multi-type web relation extraction based on bootstrapping,”
in 2010 WASE Inter. Conf. on Information Engineering,
24
–27
(2010). Google Scholar
Kambhatla, N.,
“Combining lexical, syntactic, and semantic features with maximum entropy models for information extraction,”
Annual Meeting of Association of Computational Linguistics, 178
–181
(2004). Google Scholar
Saha, S. K., Sarkar, S. and Mitra, P.,
“Feature selection techniques for maximum entropy based biomedical named entity recognition,”
Journal of Biomedical Informatics, 42
(5), 905
–911
(2009). https://doi.org/10.1016/j.jbi.2008.12.012 Google Scholar
Culotta, A. and Sorensen, J.,
“Dependency tree kernels for relation extraction,”
in Proc. of the 42nd Annual Meeting of the Association for Computational Linguistics (ACL-04),
423
–429
(2004). Google Scholar
Poulymenopoulou, M., Malamateniou, F. and Vassilacopoulos, G.,
“Machine learning for knowledge extraction from PHR big data,”
Stud. Health. Technol. Inform, 202
(1), 36
–39
(2014). Google Scholar
Zhao, H. and Wang, F.,
“A deep learning model and self-training algorithm for theoretical terms extraction,”
Journal of the China Society for Scientific and Technical Information, 37
(9), 923
–938
(2018). Google Scholar
Luo, L., Yang, Z., Yang, P., et al.,
“An attention-based BiLSTM-CRF approach to document-level chemical named entity recognition,”
Bioinformatics, 34
(8), 1381
–1388
(2018). https://doi.org/10.1093/bioinformatics/btx761 Google Scholar
Peng, N., Poon, H., Quirk, C., et al.,
“Cross-sentence N-ARY relation extraction with graph LSTMs,”
Transactions of the Association for Computational Linguistics, 5
(1), 101
–115
(2017). https://doi.org/10.1162/tacl_a_00049 Google Scholar
Lin, Y., Ji, H., Huang, F., et al.,
“A joint neural model for information extraction with global features,”
in Proc. of the 58th Annual Meeting of the Association for Computational Linguistics,
7999
–8009
(2020). Google Scholar
|