This AI Paper Introduces A Comprehensive RDF Dataset With Over 26 Billion Triples Covering Scholarly Data Across All Scientific Disciplines

On Aug 19, 2023

Keeping up with recent research is becoming increasingly difficult due to the rise of scientific publications. For instance, more than 8 million scientific articles were recorded in 2022 alone. Researchers use various techniques, from search interfaces to recommendation systems, to investigate connected intellectual entities, such as authors and institutions. Modeling the underlying academic data as an RDF knowledge graph (KG) is one efficient method. This makes standardization, visualization, and interlinking with Linked Data resources easier. As a result, scholarly KGs are essential for converting document-centric academic material into linked and automatable knowledge structures.

However, one or more of the following are limitations of the existing academic KGs:

They seldom include a comprehensive list of works from every subject.
They frequently solely cover particular fields, like computer science.
They get updated infrequently, making a lot of studies and business models outdated.
They often have use limitations.
They do not comply with W3C standards like RDF, even if they meet these criteria.

These problems prevent the widespread deployment of scientific KGs, such as in thorough search and recommender systems or for quantifying scientific impact. For instance, the Microsoft Academic Knowledge Graph (MAKG), its RDF descendant, cannot be updated because the Microsoft Academic Graph was terminated in 2021.

The innovative OpenAlex dataset seeks to close this gap. OpenAlex’s data, however, does not adhere to the Linked Data Principles and is not accessible in RDF. As a result, OpenAlex cannot be regarded as a KG, making semantic inquiries, application integration, and connecting to new resources difficult. At first appearance, it could seem like a straightforward way to include academic information about scientific articles into Wikidata, and so support the WikiCite movement. Apart from the specific schema, the amount of data is already so vast that the Wikidata Query Service’s Blazegraph triplestore approaches its capacity limit, blocking any integration.

SemOpenAlex, a very sizable RDF dataset of the academic landscape with its publications, authors, sources, institutions, ideas, and publishers, is introduced by researchers from Karlsruhe Institute of Technology and Metaphacts GmbH in this work. SemOpenAlex has about 249 million papers from all academic areas and more than 26 billion semantic triples. It is built on their comprehensive ontology and references additional LOD sources, including Wikidata, Wikipedia, and the MAKG. They offer a public SPARQL interface to facilitate quick and effective usage of SemOpenAlex’s integration with the LOD cloud. Additionally, they provide a sophisticated semantic search interface that enables users to retrieve information in real-time about entities contained in the database and their semantic relationships (for example, by displaying co-authors or an author’s most important concepts, which are inferred through semantic reasoning rather than being directly contained in the database).

They also offer the whole RDF data snapshots to facilitate large data analysis. They have created a pipeline utilizing AWS for routinely updating SemOpenAlex completely without any service disruptions due to the scale of SemOpenAlex and the growing number of scientific articles being integrated into SemOpenAlex. Additionally, they trained cutting-edge knowledge graph entity embeddings for usage with SemOpenAlex in downstream applications. They guarantee system interoperability in line with FAIR principles by employing pre-existing ontologies whenever possible, and they open the door for integrating SemOpenAlex into the Linked Open Data Cloud. By offering monthly updates that enable continuing monitoring of an author’s scientific impact, tracking of award-winning research, and other use cases employing their data, they fill the void left by the termination of MAKG. They enable research groups from many disciplinary backgrounds to access the data it provides and incorporate it into their studies by making SemOpenAlex free and unconstrained. Initial SemOpenAlex application cases and production systems currently exist.

Overall, they contribute the following:

1. They use popular vocabulary to develop an ontology for SemOpenAlex.

2. At https://semopenalex.org, they produce the SemOpenAlex knowledge graph in RDF, which covers 26 billion triples, and make all SemOpenAlex data, code, and services available to the public.

3. They enable SemOpenAlex to participate in the Linked Open Data cloud by making all its URIs resolvable. Using a SPARQL endpoint, they index all the data in a triple store and make it accessible to the general public.

4. They offer a semantic search interface with entity disambiguation so that users may access, search, and instantly view the knowledge graph and its essential statistical data.

5. Using high-performance computation, they offer cutting-edge knowledge graph embeddings for the entities represented in SemOpenAlex.

Check out the Paper. All Credit For This Research Goes To the Researchers on This Project. Also, don’t forget to join our 28k+ ML SubReddit, 40k+ Facebook Community, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more.

If you like our work, please follow us on Twitter

Aneesh Tickoo is a consulting intern at MarktechPost. He is currently pursuing his undergraduate degree in Data Science and Artificial Intelligence from the Indian Institute of Technology(IIT), Bhilai. He spends most of his time working on projects aimed at harnessing the power of machine learning. His research interest is image processing and is passionate about building solutions around it. He loves to connect with people and collaborate on interesting projects.

🔥 Use SQL to predict the future (Sponsored)

Credit: Source link