search

  • Maps & Floorplans
  • Libraries A-Z

University of Missouri Libraries

  • Ellis Library (main)
  • Engineering Library
  • Geological Sciences
  • Journalism Library
  • Law Library
  • Mathematical Sciences
  • MU Digital Collections
  • Veterinary Medical
  • More Libraries...
  • Instructional Services
  • Course Reserves
  • Course Guides
  • Schedule a Library Class
  • Class Assessment Forms
  • Recordings & Tutorials
  • Research & Writing Help
  • More class resources
  • Places to Study
  • Borrow, Request & Renew
  • Call Numbers
  • Computers, Printers, Scanners & Software
  • Digital Media Lab
  • Equipment Lending: Laptops, cameras, etc.
  • Subject Librarians
  • Writing Tutors
  • More In the Library...
  • Undergraduate Students
  • Graduate Students
  • Faculty & Staff
  • Researcher Support
  • Distance Learners
  • International Students
  • More Services for...
  • View my MU Libraries Account (login & click on My Library Account)
  • View my MOBIUS Checkouts
  • Renew my Books (login & click on My Loans)
  • Place a Hold on a Book
  • Request Books from Depository
  • View my ILL@MU Account
  • Set Up Alerts in Databases
  • More Account Information...

Data Sets for Quantitative Research: Public Use Datasets

  • Public Use Datasets
  • Roper Center
  • Missouri government grants and grant-writing aids
  • Contact Librarian

Finding Datasets on the Internet

There are many research organizations making data available on the web, but still no perfect mechanism for searching the content of all these collections. The links below will take you to data search portals which seem to be among the best available. Note that these portals point to both free and pay sources for data, and to both raw data and processed statistics.

  • PEW Research Center
  • Open Access Directory (OAD) Data Repositories
  • UK Data Archive
  • Socioeconomic Applications Data Center
  • Council of European Social Science Data Archives (CESSDA)
  • NTIS Federal Computer Products Center .*  Includes databases, data files, CD-ROM, etc. available for purchase.  
  • Harvard DataVerse
  • r3Data.org Registry of Research Data Repositories
  • Open Data: European Commission Launches European Data Portal (over 1 million datasets From 36 countries)
  • Awesome Public Datasets (on github)*. Includes a mix of free and pay resources.
  • SNAP (Stanford Network Analysis Project)
  • Statistics, Resources and Big Data on the Internet, 2020 *

 * Resources that are not entirely free are marked with an asterisk.

Transform web information into machine-readable data for analysis

Have you found fantastic numeric information in a less-than-ideal format, such as PDF or HTML?   Here are some software products that may help you transform those formats into numbers that you can read into a spreadsheet or statistical software program.  Some of these are free or offer limited time, free trials:

  • Spark OCR : Find tables in images, visually identify rows and columns, and extract data from cells into data frames. Turn scans from financial disclosures, academic papers, lab results and more into usable data. 
  • PDFTables : PDF to Excel Converter
  • Tabula : Extract tables from PDFs
  • table-ocr : For those who know Python
  • Abbyy Finereader : Access and modify information locked in paper-based documents and PDF files
  • OCR Space : This free service transforms PDFs into plain text files directly in your browser.  Rows and columns are preserved, making it easier to import the file into Excel using the Import Text Wizard .  See further explanation and instructions here:  Table recognition with OCR .  
  • Parsehub : Data mining tool for data scientists and journalists
  • Webhose : Turn unstructured web content into machine-readable data feeds
  • Data Streamer : Index weblogs, mainstream news, and social media
  • Outwit : Turn websites into structured data

Feeling intrigued, but unsure how to leverage web-based data for your own research?  Here are some how-to guides:

  • Data Journalism: What it is and why should I care?
  • How to get data from the Web
  • Manipulating data
  • Data for journalists: a practical guide for computer-assisted reporting by Brant Houston (2019)
  • Scraping for journalists by Paul Bradshaw (2013)
  • Data Journalism Heist by Paul Bradshaw (2013)

Selected datasets on the Internet, arranged by topic

These are some of the most significant datasets available on the internet, arranged by topic.  Almost everything here is freely available. The few that do involve fees are marked with asterisks (*). Note that some of the listings below are also available in ICPSR.

Political Science/Public Policy

  • American National Election Studies
  • Conflict and Peace Data Bank, 1948-1978  available through ICPSR
  • Correlates of War
  • Cross-National Time-Series Data Archive  (available as a library item on CD-ROM)
  • International Country Risk Guide (IRCG) Table 3B: Political Risk Points by Component, 1984-2009  (available through MU Library to current affiliates)
  • Polidata Presidential Results By Congressional District 1992-2004  (available through MU Library to current affiliates)
  • Record of American Democracy, 1984-1990
  • Survey of Income and Program Participation 

Demographics

  • IPUMS: Integrated Public Use Microdata Series
  • Geocorr  -- Geographic Correspondence Engine
  • Missouri Census Data Center UEXPLORE/Dexter  ( explanation )
  • National Historical Geographic Information System

Business and Economics

  • Consumer Expenditure Surveys microdata
  • National Bureau of Economic Research data
  • National Longitudinal Surveys from the Bureau of Labor Statistics
  • Panel Study of Income Dynamics
  • World Bank – Poverty and Equity Data
  • International industrial development  (manufacturing, mining, utilities, etc.) data from the United Nations (UNIDO)
  • Biologic Specimen and Data Repository Information Coordinating Center (bioLINCC)
  • Demographic and Health Surveys (mainly 3rd world countries)
  • Global Health Observatory data repository  from the World Heath Organization
  • ICPSR Health and Medical Care Archive  
  • ICPSR National Addiction and HIV Data Archive Program
  • National Center for Health Statistics Public Use Data Files  from the U.S. Centers for Disease Control
  • Missouri Information for Community Assessment (MICA) health datasets
  • National Longitudinal Study of Adolescent Health
  • National Cancer Institute SEER data
  • DataONE   Earth and environmental data
  • EPA Environmental dataset gateway
  • General Social Survey
  • National Longitudinal Surveys  (U.S. Bureau of Labor Statistics)
  • National Survey of Households and Families
  • Pew Internet & Technology
  • World Values Survey
  • National Center for Education Statistics DataLab
  • The National Survey of College Graduates (NSCG)
  • NCES Public Elementary & Secondary Schools Universe Survey Data
  • The Survey of Doctorate Recipients (SDR)

Miscellaneous

  • American Religion Data Archive
  • National Household Travel Survey
  • Roper Opinion Polls * 

*Resources that are not entirely free are marked with an asterisk

  • << Previous: Home
  • Next: ICPSR >>
  • Last Updated: Nov 3, 2023 1:05 PM
  • URL: https://libraryguides.missouri.edu/datasets

Facebook Like

info This is a space for the teal alert bar.

notifications This is a space for the yellow alert bar.

National University Library

Research Process

  • Brainstorming
  • Explore Google This link opens in a new window
  • Explore Web Resources
  • Explore Background Information
  • Explore Books
  • Explore Scholarly Articles
  • Narrowing a Topic
  • Primary and Secondary Resources
  • Academic, Popular & Trade Publications
  • Scholarly and Peer-Reviewed Journals
  • Grey Literature
  • Clinical Trials
  • Evidence Based Treatment
  • Scholarly Research
  • Database Research Log
  • Search Limits
  • Keyword Searching
  • Boolean Operators
  • Phrase Searching
  • Truncation & Wildcard Symbols
  • Proximity Searching
  • Field Codes
  • Subject Terms and Database Thesauri
  • Reading a Scientific Article
  • Website Evaluation
  • Article Keywords and Subject Terms
  • Cited References
  • Citing Articles
  • Related Results
  • Search Within Publication
  • Database Alerts & RSS Feeds
  • Personal Database Accounts
  • Persistent URLs
  • Literature Gap and Future Research
  • Web of Knowledge
  • Annual Reviews
  • Systematic Reviews & Meta-Analyses
  • Finding Seminal Works
  • Exhausting the Literature
  • Finding Dissertations
  • Researching Theoretical Frameworks
  • Research Methodology & Design
  • Tests and Measurements
  • Organizing Research & Citations This link opens in a new window
  • Scholarly Publication
  • Learn the Library This link opens in a new window

SAGE Research Methods: Datasets

Statistics

Content: Practical guides to data analysis, comprised of peer-reviewed datasets and tools to manage data.

Purpose: Use to learn and practice data analysis including cleaning and normalizing data. 

A dataset (also spelled ‘data set’) is a collection of raw statistics and information generated by a research study. Datasets produced by government agencies or non-profit organizations can usually be downloaded free of charge. However, datasets developed by for-profit companies may be available for a fee.

Most datasets can be located by identifying the agency or organization that focuses on a specific research area of interest. For example, if you are interested in learning about public opinion on social issues, Pew Research Center would be a good place to look. For data about population, the U.S. government’s Population Estimates Program from American Factfinder  would be a good source.

An “open data” philosophy is becoming more common among governments and business organizations around the world, with the belief that data should be freely accessible. Open data efforts have been led by both the government and non-government organizations such as the Open Knowledge Foundation . Learn more by exploring The Open Data Handbook . There is also a growing trend in what is being called “ Big Data ”, where extremely large amounts of data are analyzed for new and interesting perspectives, and data visualization , which is helping to drive the availability and accessibility of datasets and statistics.

Don't know where to begin? Here is a quick view of our recommendations.

* Indicates that datasets on this topic are prominent in the source

For additional information about locating statistics , please see our Statistics page.

  • The Evolution of Big Data, and Where We’re Headed
  • Data Visualization and Infographics

Subject Specific and Additional Dataset Resources

  • Computer Science
  • Public Opinion/Surveys
  • Social Sciences
  • Social Media or Community Driven Datasets
  • Additional Dataset Resources
  • Large Datasets
  • Searchable Sites
  • Datasets for Learning Purposes
  • Tools for Data Analysis
  • Damodaran Online: Corporate Finance and Valuation NYU, Stern School of Business, Dr. Aswath Damodaran
  • IMF DataMapper
  • IMF Fiscal Rules Dataset (1985-2013)
  • International Monetary Fund Data & Statistics
  • National Longitudinal Surveys Bureau of Labor Statistics
  • Organization for Economic Co-Operation and Development Data
  • Quandl “Time-series” numerical only data for economics, finance, markets & energy; Features step-by-step wizard for finding and compiling data.
  • Statistical Abstract of the United States (2012): Banking, Finance, & Insurance
  • Statistical Abstract of the United States (2012): Business Enterprise
  • Surveys of Consumers Thomson Reuters & University of Michigan
  • U. S. Bureau of Economic Data
  • Mergent Online Financial records, country and industry reports. Searchable by company name, country, number of employees and more. Up to 15 years of historical data. Also provides news articles on recent mergers and acquisitions, as well as industry and country reports.
  • ACM A research, discovery and network platform. The database provides journals, conference proceedings, technical magazines, newsletters and books. Provides a list of authors after an initial topic search, includes a dataset search filter, and the ability to sort results by most cited.
  • IEEE Full-text peer-reviewed journals, transactions, magazines, conference proceedings, and published standards in the areas of electrical engineering, computer science, and electronics. Access to the IEEE Standards Dictionary Online. Useful to learn about current technology industry trends. Along with ACM database, this database has a function that allows searching for datasets.
  • Barro-Lee Dataset Datasets available for download from their article: Barro, R., & Lee, J. (n.d). A new dataset of educational attainment in the world, 1950-2010. Journal Of Development Economics,104,184-198.
  • Child care and Early Education Research Connections
  • Datasets from NCES
  • Education Data.gov
  • Higher Education General Information Survey (HEGIS) Series
  • Integrated Postsecondary Education Data System (IPEDS)
  • National Center for Education Statistics (NCES)
  • Statistical Abstract of the United States (2012): Education
  • U.K. Department of Education Datasets
  • American Psychological Association Links to datasets and Repositories
  • Children Born to Unwed Parents between 1998-2000 Princeton
  • Childstats.gov Forum on Child and Family Statistics
  • Gender & Achievement Research Program
  • The Kinsey Institute Data Archives
  • National Archive of Criminal Justice Data
  • National Data Archive on Child Abuse and Neglect
  • National Longitudinal Study of Adolescent Health Add Health
  • Neuroscience Information Framework (NIF) Data Federation
  • Substance Abuse and Mental Health Data Archive (SAMHA)
  • Gallup.com Global datasets on what people from all over the world think about important social issues, as well as financial behavior and literacy.
  • General Social Survey (GSS) A social trends survey conducted on American society and compared to international trends. This survey has been unchanged since 1972. datasets are in SPSS and STATA formats, with additional options available.
  • International Social Survey Programme (ISSP) Affiliated with the GSS, this survey has been conducted since 1980.
  • The Latin American Databank Provides a portal for Latin American datasets acquired, processed and archives by the Roper Center for Public Opinion Research. Data can be browsed by country or decade. Keyword search options are also available.
  • Pew Research Center Datasets available to download for many of the Center’s main projects. Free registration is required to download.
  • Roper Center Public Opinion Archives Over 20,000 datasets available from 1935 to present. Users can also set up an RSS feed for updates.
  • World Values Survey Datasets available to download for surveys dating back to 1981 in SPSS, SAS and STATA formats.
  • Consortium of European Social Science Data Archives (CESSDA)
  • Gapminder A non-profit organization that calls itself a “fact tank”. More than 500 world demographic indicators from the World Bank, Lancet and many other entities are available for download in Excel format, view or visualize
  • Inter-university Consortium for Political and Social Research (ICPSR) One of the largest collections of data for social and behavioral research. File formats include SPSS, SAS and csv.
  • National Archive on Criminal Justice Data
  • National Center for Health Statistics (NCHS) Extensive tutorials are available to assist users with learning how to incorporate NCHS data into their research.
  • The Odum Institute Dataverse University of North Carolina Chapel Hill
  • U. S. Department of Housing and Urban Development (HUD) Housing and housing market data provided by government.
  • U.K. Data Service Sponsored by the U.K. Economic & Social Research Council (ESRC).
  • Association of Religion Archives
  • U.S. Bureau of Labor Statistics Economy and labor market provided by governmental site.
  • U.S. Census Bureau Population demographics provided by governmental site.
  • Guardian (UK) Datablog
  • Kaggle 3rd party, multi-disciplinary crowd-sourcing platform. Check credibility of data provided by institutions not affiliated with academic or professional instiuttions.
  • Social Computing Data Repository Arizona State University collects and makes available for download datasets from the most popular social networks including Twitter, FourSquare, YouTube and more.
  • Stanford Large Network Dataset Collection Features data from social networks, online reviews and more.
  • Registry of Open Data on AWS Notable sets include the NASA Nex Project and 1000 Genome Project.
  • Figshare 3rd party multi-disciplinary repository. Search by keyword or browse by subject.
  • Africa Open Data Search and download more than 900 datasets from countries across the continent. File formats are available in csv, zip and shapefile (shp) for use with GIS software.
  • American Fact Finder A division of the US Census Bureau, this site provides datasets from censuses and surveys conducted by the Bureau.
  • Data.gov The gateway to searching and discovering U.S. government data. This sites boasts over 90,000 datasets!
  • Data.gov.uk Search over 17,000 datasets from the government of the United Kingdom. This database allows for limiting search results by theme (subject), format (file type) and publisher.
  • European Union Open Data Portal Gateway to data produced by EU member institutions.The homepage features most viewed datasets, as well as updated datasets and top publishers (agencies/institutions). Most datasets can be downloaded in pdf or zip formats.
  • National Digital Archive of Datasets A division of the U.K. National Archives, these datasets are from 1997-2010. Fully searchable and can be downloaded in html, csv, xls, and more.
  • Open Data Canada Search and download datasets in different formats (csv, xml, zip, html). Featured datasets are also available across a wide range of categories.
  • United Nations Data The gateway to data and statistics for UN supported projects, including the Monthly Bulletin of Statistics. To learn how to best use this resource, see these FAQs.
  • UN Statistical Databases Directory of UN statistical databases from the United Nations Dag Hammarskjöld Library.
  • World Bank Datasets can be browsed and searched across a wide range of indicators and categories. Download options are available from basic to advanced. View the World Bank Databank tutorial to learn more about how to use and download datasets.
  • Datacatalogs.org This site provides a browseable or searchable of open data catalogs around the world, including government and non-government sources.
  • Datacite A repository of open datasets that are available online. Links to the dataset homepage are available along with the associated subjects, publisher (authority) and description.
  • Dryad A curated resource that makes research data discoverable, freely reusable, and citable. Provides a general-purpose home for a wide diversity of data types.
  • Google Public Data Freely available tool for searching public datasets.Importing, saving and linking tools are also available. See more from Google Public Data Help.
  • Harvard Dataverse Network An open network of research and scientific data containing over 50,000 studies.
  • Qualitative Data Repository Dedicated archive for storing and sharing digital data (and accompanying documentation) generated or collected through qualitative and multi-method research in the social sciences and related disciplines.
  • Figshare Figshare is a repository where users can make all of their research outputs available in a citable, shareable and discoverable manner
  • Re3 Data Re3data is a global registry of research data repositories that covers research data repositories from different academic disciplines. It includes repositories that enable permanent storage of and access to data sets to researchers, funding bodies, publishers, and scholarly institutions. re3data promotes a culture of sharing, increased access and better visibility of research data. The registry has gone live in autumn 2012 and has been funded by the German Research Foundation (DFG).
  • Kaggle This for-profit company offers data forecasting services for the energy industry, also maintains a platform for “predictive modeling competitions”. Get a team together and challenge yourselves to compete!
  • Sociology Data Set Server St. Joseph’s University, Dept. of Sociology
  • SPSS Data Page East Carolina University, Dept. of Psychology, Dr. Karl L. Wuensch
  • SPSS Data Sets Butler University, Dept. of Psychology, Dr. Roger J. Padgett
  • Statistical Reference Datasets National Institute for Standards & Technology
  • Statistics for Psychology University of Bath, Dept. of Psychology, Dr. Ian Walker
  • Teaching with Data While this site does not have datasets to download, they have excellent resources for locating datasets and other tools for using data in education.
  • UCI Machine Learning Repository Used primarily for the computer sciences, a number of social sciences datasets are available here. Each dataset has cited references.
  • V7 Open Datasets Open-access searchable page with over 500 quality datasets.
  • National Map This website provides datasets for representing U. S. government data using various map tools. Maps include: The National Atlas of the United States, U.S. Topo, Historical Topographic Map Collection, and the National Map Viewer.
  • Nesstar (Norwegian Social Science Data Services) An open access, web-based tool for publishing and analyzing data.
  • OpenRefine Formerly Google Refine, this free tool allows intermediate to advanced level users multiple options for managing large datasets.
  • Social Explorer This tool allows users to manipulate data from demographic and economic sources to create their own maps, interactive images, and more. The limited free version provides access to data from the 2000 US Census.
  • Statwing A limited free tool to analyze and visualize data. (Note:The free version makes your data available publicly up to 25mb.)
  • TableauPublic A free tool for visualizing data in a wide variety of design options.

Health Dataset Sites

  • Hospitals and Spending
  • Medicaid & Medicare
  • Multi-topic
  • Non-Profit Hospitals
  • CDC BRFSS Health-related telephone surveys that collect state data about U.S. residents regarding their health-related risk behaviors, chronic health conditions, and use of preventive services.
  • CDC Data Statistics on major diseases.
  • CDC Wonder Data on diseases and death, as well as prevention.

Sources for statistics on hospitals and/or hospital spending.

  • AHA Annual survey of hospitals in the United States. It includes the number of government hospitals, the number hospitals in each bed, and the number of hospital beds.
  • AHD The American Hospital Directory® provides data, statistics, and analytics about more than 7,000 hospitals nationwide. AHD.com® hospital information includes both public and private sources such as Medicare claims data, hospital cost reports, and commercial licensors. AHD® is not affiliated with the American Hospital Association (AHA) and is not a source for AHA Data. Our data are evidence-based and derived from the most definitive sources.
  • AHRQ HCUPnet is a free, on-line query system based on data from the Healthcare Cost and Utilization Project (HCUP). The system provides health care statistics and information for hospital inpatient, emergency department, and ambulatory settings, as well as population-based health care data on counties
  • HCUPnet HCUPnet is a free, on-line query system based on data from the Healthcare Cost and Utilization Project (HCUP). The system provides health care statistics and information for hospital inpatient, emergency department, and ambulatory settings, as well as population-based health care data on counties
  • CMS.gov Research, data and statistics on Medicare & Medicaid from The Centers for Medicare & Medicaid Services, CMS, part of the Department of Health and Human Services (HHS).
  • Medicare.gov Compare hospitals quality of care.

A list of public datasets by topic, from the Society of General Internal Medicine.

  • Propublica Various healthcare datasets including Medicare, treatments and nursing homes.
  • HealthData.gov 50 datasets, mostly related to Covid.
  • IRS EXENon-Profit Hospital Compliance Report The IRS commenced its Hospital Compliance Project (Project) in May 2006 to study nonprofit hospitals and community benefit, and to determine how nonprofit hospitals establish and report executive compensation. The Project involved mailing out a comprehensive compliance check questionnaire to 544 nonprofit hospitals and analyzing their responses.1
  • CDC NCHS Access data from national health surveys.

Digging for Data Webinar

Not sure where or how to start your data search? This webinar provides a basic overview of how to find datasets using Google Dataset Search and other dataset directories/repositories, and answer any questions you bring to the session.

Searching for Datasets Online

  • Google Dataset Search

Google Dataset Search is a search engine across metadata for millions of datasets in thousands of repositories across the Web. Similar to how Google Scholar works, Google Dataset Search lets you find datasets wherever they’re hosted, whether it’s a publisher's site, a digital library, or an author's personal web page.

Dataset Search can be useful to a broad audience, whether you're looking for scientific data, government data, or data provided by news organizations. Simply enter what you are looking for, and the results will guide you to the published dataset on the repository provider’s site.

Screenshot of search results for Google Dataset Search

Persistent links to datasets may be found by clicking on the share icon. You can may then copy/paste the link to share or save the location.

Screenshot showing the share feature in Google Dataset Search

  • To find open data for a particular U.S. state or country, try using a search engine and the keywords: open data [name of state or country] , as shown in the image below.

Screenshot of Google search results for search terms arizona open data.

  • You can also search Google for datasets by typing in your topic followed by the keywords "raw data" or "datasets" . For example, "barriers to AI adoption raw data or datasets".
  • Lastly, you can search Google for xls. file type , which will pull excel documents that might contain raw data. For example, "artificial intelligence filetype: xls"

Locating an Original Dataset from a Journal Article

  • ACM Digital Library
  • IEEE Xplore Digital Library

e-Book

Content: The Association of Computing Machinery database is a research, discovery and network platform. The database provides journals, conference proceedings, technical magazines, newsletters and books.

Purpose: An essential database computing and technology research topics.

Special Features: Provides a list of authors after an initial topic search, includes a dataset search filter, and the ability to sort results by most cited.

Use the following steps to locate the actual dataset used in a research article within the ACM Digital Library database. 

  • Access the  ACM Digital Library  database from the  A-Z Databases List . 
  • Using the search box, enter your keyword terms to locate relevant research articles on your topic. 

ACM Digital Library search results refined by content formats

Content: Full-text peer-reviewed journals, transactions, magazines, conference proceedings, and published standards in the areas of electrical engineering, computer science, and electronics.

Purpose: Users may learn about technology industry information

Special Features: Users may search datasets

To limit to full-text only, change the results from "All Results" to "My Subscribed Content".

Use the following steps to locate the actual dataset used in a research article within the IEEE Xplore Digital Library database. 

  • Access the  I EEE Xplore Digital Library  database from the  A-Z Databases List . 

IEEE Xplore Digital Library search box

Was this resource helpful?

  • << Previous: Statistics
  • Next: Scholarly Research >>
  • Last Updated: Apr 14, 2024 12:14 PM
  • URL: https://resources.nu.edu/researchprocess

National University

© Copyright 2024 National University. All Rights Reserved.

Privacy Policy | Consumer Information

Datasets: arxiv_dataset like 65

The dataset viewer is disabled because the authors forbid processing this dataset automatically and require the users to download the dataset files manually .

Dataset Card for arXiv Dataset

Dataset summary.

A dataset of 1.7 million arXiv articles for applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

Supported Tasks and Leaderboards

[More Information Needed]

The language supported is English

Dataset Structure

Data instances.

This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. An example is given below

Data Fields

  • id : ArXiv ID (can be used to access the paper)
  • submitter : Who submitted the paper
  • authors : Authors of the paper
  • title : Title of the paper
  • comments : Additional info, such as number of pages and figures
  • journal-ref : Information about the journal the paper was published in
  • doi : Digital Object Identifier
  • report-no : Report Number
  • abstract : The abstract of the paper
  • categories : Categories / tags in the ArXiv system

Data Splits

The data was not splited.

Dataset Creation

Curation rationale.

For nearly 30 years, ArXiv has served the public and research communities by providing open access to scholarly articles, from the vast branches of physics to the many subdisciplines of computer science to everything in between, including math, statistics, electrical engineering, quantitative biology, and economics. This rich corpus of information offers significant, but sometimes overwhelming depth. In these times of unique global challenges, efficient extraction of insights from data is essential. To help make the arXiv more accessible, a free, open pipeline on Kaggle to the machine-readable arXiv dataset: a repository of 1.7 million articles, with relevant features such as article titles, authors, categories, abstracts, full text PDFs, and more is presented to empower new use cases that can lead to the exploration of richer machine learning techniques that combine multi-modal features towards applications like trend analysis, paper recommender engines, category prediction, co-citation networks, knowledge graph construction and semantic search interfaces.

Source Data

This data is based on arXiv papers. [More Information Needed]

Initial Data Collection and Normalization

Who are the source language producers, annotations.

This dataset contains no annotations.

Annotation process

Who are the annotators, personal and sensitive information, considerations for using the data, social impact of dataset, discussion of biases, other known limitations, additional information, dataset curators.

The original data is maintained by ArXiv

Licensing Information

The data is under the Creative Commons CC0 1.0 Universal Public Domain Dedication

Citation Information

Contributions.

Thanks to @tanmoyio for adding this dataset.

Models trained or fine-tuned on arxiv_dataset

Callidior/bert2bert-base-arxiv-titlegen, wi/arxiv-distilbert-base-cased.

research paper data set

TromeroResearch/SciMistral-V1

Ananu/scimistral-v1-gguf.

research paper data set

jordyvl/baseline_BERT_50K_steps

Jordyvl/test_implementation.

Subscribe to the PwC Newsletter

Join the community, edit dataset, edit dataset tasks.

Some tasks are inferred based on the benchmarks list.

Add a Data Loader

Remove a data loader.

  • tensorflow/datasets -
  • armancohan/long-summarization -

Edit Dataset Modalities

Edit dataset languages, edit dataset variants.

The benchmarks section lists all benchmarks using a given dataset or any of its variants. We use variants to distinguish between results evaluated on slightly different versions of the same dataset. For example, ImageNet 32⨉32 and ImageNet 64⨉64 are variants of the ImageNet dataset.

Add a new evaluation result row

Arxiv summarization dataset.

research paper data set

This is a dataset for evaluating summarisation methods for research papers.

Benchmarks Edit Add a new result Link an existing benchmark

Dataset loaders edit add remove.

research paper data set

Similar Datasets

Arxiv hep-th citation graph, license edit, modalities edit, languages edit.

Public Use Data Archive

The NBER Public Use Data Archive is an eclectic mix of public-use economic, demographic, and enterprise data obtained over the years to satisfy the specific requests of NBER-affiliated researchers for particular projects. Files here are often in more convenient formats than the original data source. However, files that receive updates at the source may not be updated here. The Public Use Data archive also serves as a repository of the outputs, be they data or code, of NBER projects that, when allowed by the sources, are intended for wider use or replication efforts.

Economic Indicators and Releases

More from NBER

In addition to working papers , the NBER disseminates affiliates’ latest findings through a range of free periodicals — the NBER Reporter , the NBER Digest , the Bulletin on Retirement and Disability , the Bulletin on Health , and the Bulletin on Entrepreneurship  — as well as online conference reports , video lectures , and interviews .

15th Annual Feldstein Lecture, Mario Draghi, "The Next Flight of the Bumblebee: The Path to Common Fiscal Policy in the Eurozone cover slide

unarXive : a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata

  • Open access
  • Published: 02 March 2020
  • Volume 125 , pages 3085–3108, ( 2020 )

Cite this article

You have full access to this open access article

research paper data set

  • Tarek Saier   ORCID: orcid.org/0000-0001-5028-0109 1 &
  • Michael Färber   ORCID: orcid.org/0000-0001-5458-8645 1  

7653 Accesses

19 Citations

6 Altmetric

Explore all metrics

In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several regards. In this paper, we propose a new data set based on all publications from all scientific disciplines available on arXiv.org. Apart from providing the papers’ plain text, in-text citations were annotated via global identifiers. Furthermore, citing and cited publications were linked to the Microsoft Academic Graph, providing access to rich metadata. Our data set consists of over one million documents and 29.2 million citation contexts. The data set, which is made freely available for research purposes, not only can enhance the future evaluation of research paper-based and citation context-based approaches, but also serve as a basis for new ways to analyze in-text citations, as we show prototypically in this article.

Similar content being viewed by others

research paper data set

How to Write and Publish a Research Paper for a Peer-Reviewed Journal

research paper data set

Literature reviews as independent studies: guidelines for academic practice

research paper data set

Artificial intelligence to automate the systematic review of scientific literature

Avoid common mistakes on your manuscript.

Introduction

A variety of tasks use scientific paper collections to help researchers in their work. For instance, research paper recommender systems have been developed (Beel et al. 2016 ). Related are systems that operate on a more fine-grained level within the full text, such as the textual contexts in which citations appear (i.e., citation contexts). Based on citation contexts, things like the citation function (Teufel et al. 2006a , b ; Moravcsik and Murugesan 1975 ), the citation polarity (Ghosh et al. 2016 ; Abu-Jbara et al. 2013 ), and the citation importance (Valenzuela et al. 2015 ; Chakraborty and Narayanam 2016 ) can be determined. Furthermore, citation contexts are necessary for context-aware citation recommendation (He et al. 2010 ; Ebesu and Fang 2017 ), as well as for citation-based document summarization tasks (Chandrasekaran et al. 2019 ), such as citation-based automated survey generation (Mohammad et al. 2009 ) and automated related work section generation (Chen and Zhuge 2019 ).

The evaluation of approaches developed for all these tasks as well as the actual applicability and usefulness of developed systems in real-world scenarios heavily depend on the used data set. Such a data set is typically a collection of papers provided in full text, or a set of already extracted citation contexts, consisting of, for instance, 1–3 sentences each. Existing data sets, however, do not fulfill all of the following criteria (see section “ Existing data sets ” for more details):

Size. The data set can be comparatively small (below 100,000 documents) which makes it difficult to use it for training and testing machine learning approaches;

Cleanliness. The papers’ full texts or citation contexts are often very noisy due to the conversion from PDF to plain text and due to encoding issues;

Global citation annotations. No links from the citations in the text to the structured representations of the cited publications across documents are provided;

Data set interlinkage. Data sets often do not provide identifiers of the citing and cited documents from widely used bibliographic databases, such as DBLP Footnote 1 or the Microsoft Academic Graph Footnote 2 (MAG);

Cross-domain coverage. Often, only a single scientific discipline is available for evaluating or applying an approach to a paper or citation-based task.

In this paper we propose a new scholarly data set, which we call unarXive . Footnote 3 The data set is built for tasks based on papers’ full texts, in-text citations, and metadata. It is freely available at http://doi.org/10.5281/zenodo.3385851 and the implementation for creating it at https://github.com/IllDepence/unarXive .

Considering the application of our data set, we argue, that it not only can be used as a new large data set for evaluating paper-based and citation-based approaches with unlimited citation context lengths (since the publications’ full texts are available), but also be a basis for novel ways of paper analytics within bibliometrics and scientometrics. For instance, based on the citation contexts and the citing and cited papers’ metadata in the MAG, analyses on biases in the writing and citing behavior of researchers—e.g. related to authors’ affiliation (Reingewertz and Lutmar 2018 ) or documents’ language (Liang et al. 2013 ; Liu et al. 2018 )—can be performed. Furthermore, (sophisticated) deep learning approaches, as they are also widely used in the digital library domain recently (Ebesu and Fang 2017 ), require huge amounts of training data. Our data set allows to overcome this hurdle and investigate how far deep learning approaches can lead us. Overall, we argue that with our data set we can significantly bring the state of the art of big scholarly data one step forward.

We make the following contributions in this paper:

We propose a large, interlinked scholarly data set with papers’ full texts, annotated in-text citations, and links to rich metadata. We describe its creation process in detail and provide both the data as well as the creation process implementation to the public.

We manually evaluate the validity of our reference links on a sample of 300 references, thereby providing insight into our citation network’s quality.

We calculate statistical key figures and analyze the data set with respect to its contained references and citations.

We compare our reference links to those in the MAG, and manually evaluate the validity of links only appearing in either of the data sets. In doing so, we identify a large number of documents where the MAG lacks coverage.

We analyze the likelihood with which in-text citations in our data set refer to specific parts of a cited document depending on the discipline of the citing and cited document. Such an analysis is only possible with word level precision citation marker positions annotated in full text and metadata on citing as well as cited documents. The analysis therefore can showcase the practicability of our data set.

The paper is structured as follows: After outlining related data sets in section “ Existing data sets ”, we describe in section “ Data set creation ” how we created our data set. This is followed by statistics and key figures in section “ Statistics and key figures ”. In section “ Evaluation of citation data validity and coverage ”, we evaluate the validity and coverage of our reference links. Section “ Analysis of citation flow and citation contexts ” is dedicated to the analysis of the citation flow and the contexts within our data set. We conclude in section “ Conclusion ” with a summary and an outlook.

Existing data sets

Table  2 gives an overview of related data sets. CiteSeerX can be regarded as the most frequently used evaluation data set for citation-based tasks. For our investigation, we use the snapshot of the entire CiteSeerX data set as of October 2013, published by Huang et al. ( 2015 ). This data set consists of 1,017,457 papers, together with 10,760,318 automatically extracted citation contexts. This data set has the following drawbacks (Roy et al. 2016 ; Färber et al. 2018 ): The provided meta-information about cited publications is often not accurate. Citing and cited documents are not interlinked to other data sets. Moreover, the citation contexts can contain noise from non-ASCII characters, formulas, section titles, missed references and/or other “unrelated” references, and do not begin with a complete word.

The PubMed Central Open Access Subset is another large data set that has been used for citation-based tasks (Gipp et al. 2015 ; Duma et al. 2016 ; Galke et al. 2018 ). Contained publications are already processed and available in the JATS (Huh 2014 ) XML format. While the data set overall is comparatively clean, heterogeneous annotation of citations within the text and mixed usage of identifiers of cited documents (PubMed, MEDLINE, DOI, etc.) make it difficult to retrieve high quality citation interlinkings of documents from the data set Footnote 4 (Gipp et al. 2015 ).

Beside the abovementioned, there are other collections of scientific publications. Among them are the ACL Anthology corpus (Bird et al. 2008 ) and Scholarly Dataset 2 (Sugiyama and Kan 2015 ). Note that these data sets only contain the publications themselves, typically in PDF format. Therefore, using such data sets for paper-based or citation-based approaches is troublesome, since one must preprocess the data (i.e., (1) extract the content without introducing too much noise, (2) specify global identifiers for cited papers, and (3) annotate citations with those identifiers). Furthermore, there are data sets for evaluating paper recommendation tasks, such as CiteULike Footnote 5 or Mendeley. Footnote 6 These, however, only provide metadata about publications or are not freely available for research purposes.

Prior to publishing the data set described in this paper, we already published a data set with annotated arXiv papers’ content in the past (Färber et al. 2018 ). In comparison, our new data set is superior to this initial version in the following regards:

The new data set is considerably larger (1 M instead of 90 k documents).

The new data set provides a similar level of cleanliness to the old data set regarding the papers’ full texts and citation contexts.

A new method for resolving references to consistent global identifiers has been developed. Contrary to the old method, the new method has been evaluated and performs very well (see section “ Citation data validity ”).

While the old data set links documents solely to DBLP, which covers computer science papers, the new data set links documents to the Microsoft Academic Graph, which covers all scientific disciplines and which has been used frequently in the digital library domain in recent years (Mohapatra et al. 2019 ).

While the old data set is restricted to computer science, the new data set covers all domains of arXiv (see section “ Statistics and key figures ” and Fig.  7 ).

Lastly, compared to the initial publication of our new data set (Saier and Färber 2019 ), this journal article provides significantly more details and insights into the data set’s creation process (see section “ Data set creation ”) and its resulting characteristics (see sections “ Evaluation of citation data validity and coverage " and “ Analysis of citation flow and citation contexts ”). Moreover, the data set has been further improved. Most notably, while in the initial version, only citing papers were associated with arXiv identifiers and only cited papers had been linked to the MAG, we now provide both types of IDs for both sides. This means, that for nearly all documents, MAG metadata is easily accessible, and full text is not only available for all citing papers but now also for over a quarter of the cited papers.

Data set creation

Used data sources.

The following two resources are the basis of the data set creation process.

Microsoft Academic Graph is a very large, automatically generated data set on 213 million publications, related entities (authors, venues, etc.), and their interconnections through 1.4 billion references. Footnote 8 It has been widely used as a repository of all publications in academia in the fields of bibliometrics and scientometrics (Mohapatra et al. 2019 ). While pre-extracted citing sentences are available, these do not contain annotated citation marker positions. Full text documents are also not available. The size of the MAG makes it a good target for matching reference strings Footnote 9 against it, especially given that arXiv spans several disciplines.

Pipeline overview

figure 1

Schematic representation of the data set generation process

To create the data set, we start out with arXiv sources (see Fig.  1 ). From these we generate, per publication, a plain text file with the document’s textual contents and a set of database entries reflecting the document’s reference section. Association between reference strings and in-text citation locations are preserved by placing citation markers in the text. In a second step, we then iterate through all reference strings in the database and match them against paper metadata records in the MAG. This gives us full text arXiv papers with (word level precision) citation links to MAG paper IDs. As a final step, we enrich the data with MAG IDs on the citing paper side (in addition to the already present arXiv IDs) and arXiv IDs on the cited paper side (in addition to the already present MAG IDs)—this is a straightforward process, because the paper metadata in the MAG includes source URLs, meaning papers found on arXiv have an arXiv.org source URL associated with them, such that a mapping from arXiv IDs to MAG IDs can be created.

Listing 2 shows how our data set looks like. In the following, we describe the main steps of the data set creation process in more detail.

Resulting approach

Reference resolution.

Resolving references to globally consistent identifiers (e.g. detecting that the reference strings (1), (2), and (3) in Listing 1 all reference the same document) is a challenging and still unsolved task (Nasar et al. 2018 ). Given it is the most distinctive singular part of a publication, we base our reference resolution on the title of the cited work and use other pieces of information (e.g., the authors’ names) only in secondary steps. In the following, we will describe the challenges we faced, matching arXiv documents’ reference strings against MAG paper records, and how we approached the task.

Reference resolution can be challenging when reference strings contain only minimal amounts of information, when formulas or other special notation is used in titles, or when they refer to non publications (e.g., Listing 1 , (4)–(6)). Another problem we encountered was noise in the MAG. One such case are the MAG papers with IDs 2167727518 and 2763160969 . Both are identically titled “Observation of a new boson at a mass of 125 GeV with the CMS experiment at the LHC” and dated to the year 2012. But while the former is cited 17k times and cites 112 papers within the MAG, the latter is a neither cited nor cites any other papers. Footnote 14 Taking the number of citations into account when matching references, reduced the number of mismatches in this particular case from 2,918 to 0 and improved the overall quality of matches in general.

figure a

Examples of reference strings

Our reference resolution procedure can be broken down in two steps: title identification and matching. If contained in the reference string, title identification is performed based on an arXiv ID or DOI (where we retrieve the title from an arXiv metadata dump or via crossref.org Footnote 15 ); otherwise we use Neural ParsCit (Prasad et al. 2018 ). Footnote 16 The identified title is then matched against the normalized titles of all publications in the MAG. Resulting candidates are considered, if at least one of the author’s names (as given in the MAG) is present in the reference string. If multiple candidates remain, we judge by the citation count given in the MAG—this particularly helps mitigate matches to rouge almost-duplicate entries in the MAG, which often have few to no citations, like paper 2763160969 mentioned in the previous section.

Result format

Listing 2 shows some example content from the data set. In addition to the paper plain text files and the references database, we also provide the citation contexts of all successfully resolved references extracted to a CSV file as well as a script to create custom exports. Footnote 17 For the provided CSV export, we set the citation context length to 3 sentences—the sentence containing the citation as well as the one before and after—as used by Tang et al. ( 2014 ) and Huang et al. ( 2015 ). Each line in an export CSV has the following columns: cited MAG ID, adjacent cited MAG IDs, citing MAG ID, cited arXiv ID, adjacent cited arXiv IDs, citing arXiv ID, text (see bottom of Listing 2 ). Citations are deemed adjacent, if they are part of a citation group or are at most 5 characters apart (e.g. “ [27,42]” , “[27], [42]” or “[27] and [42]” ). The IDs of adjacent cited documents are added, because those documents are cited in an almost identical context (i.e. only a few characters to the left or right).

figure b

Excerpts from (top to bottom) a paper’s plain text, corresponding entries in the references database, entries in the MAG, and extracted citation context CSV

Statistics and key figures

In this section we present the data set and its creation process in terms of numbers. Furthermore, insight into the distribution of references and citation contexts is given.

Creation process

We used an arXiv source dump containing all documents up until the end of 2018 (1,492,923 documents). 114,827 of these were only available in PDF format, leaving 1,378,096 sources. Our pipeline output 1,283,584 (93.1%) plain text files, 1,139,790 (82.7%) of which contained citation markers. The number of reference strings identified is 39,694,083, for which 63,633,427 citation markers were placed within the plain text files. This first part of the process took 67 h to run, unparallelized on a 8 core Intel Core i7-7700 3.60 GHz machine with 64 GB of memory.

Of the 39,694,083 reference strings, we were able to match 16,926,159 (42.64%) to MAG paper records. For 31.32% of the reference strings we could neither find an arXiv ID or DOI, nor was Neural ParsCit able to identify a title. Footnote 18 For the remaining 26.04% a title was identified, but could not be matched to the MAG. Of the matched 16.9 million items’ titles, 52.60% were identified via Neural ParsCit, 28.31% by DOI and 19.09% by arXiv ID. Of the identified DOIs, 32.9% were found as is, while 67.1% were heuristically determined. This was possible because the DOIs of articles in journals of the American Physical Society follow predictable patterns. The matching process took 119 h, run in 10 parallel processes on a 64 core Intel Xeon Gold 6130 2.10 GHz machine with 500 GB of memory.

Comparing the performance of our approach using all papers (1991–2018) to using only the papers from 2018 (i.e. recent content), we note that the percentage of successfully extracted plain texts goes up from 93.1 to 95.9% (82.7 to 87.8% only counting plain text files containing citation markers) and the percentage of successfully resolved references increases from 42.64 to 59.39%. A possible explanation for the latter would be, that there is more and higher quality metadata coverage (MAG, crossref.org, etc.) of more recent publications.

figure 2

Number of citing documents per cited document

figure 3

Number of citation contexts per reference

Resulting data set

Our data set consists of 2,746,288 cited papers, 1,043,126 citing papers, 15,954,664 references and 29,203,190 citation contexts . Footnote 19

figure 4

Visualization of the citation flow in terms of documents and references from arXiv to the MAG

Figure  2 shows the number of citing documents for all cited documents. There is one cited document with over 10,000 citing documents, another 8 with more than 5,000 and another 14 with more than 3,000. 1,485,074 (54.07%) of the cited documents are cited at least two times, 646,509 (23.54%) at least five times. The mean number of citing documents per cited document is 5.81 (SD 28.51). Figure  3 shows the number of citation contexts per entry in a document’s reference section. 10,537,235 (66.04%) entries have only one citation context, the maximum is 278, the mean 1.83 (SD 2.00).

Because not all documents referenced by arXiv papers are hosted on arXiv itself, we additionally visualize the citation flow with respect to the MAG in Fig.  4 . 95% of our citing documents are contained in the MAG. Of the cited documents, 26% are contained in arXiv and therefore included as full text, while 74% are only included as MAG IDs. On the level of references, this distribution shifts to 43/57. The high percentages of citation links contained within the data set can be explained due to the fact, that in physics and mathematics—which make up a large part of the data set—it is common to self-archive papers on arXiv.

Evaluation of citation data validity and coverage

Citation data validity.

To evaluate the validity of our reference resolution results, we take a random sample of 300 matched reference strings and manually check for each of them, if the correct record in the MAG was identified. This is done by viewing the reference string next to the matched MAG record and verifying, if the former actually refers to the latter. Footnote 20 Given the 300 items, we observed 3 errors, giving us an accuracy estimate of 96% at the worst, as shown in Table  4 . Table  5 shows the three incorrectly identified documents. In all three cases the misidentified document’s title is contained in the correct document’s title, and there is a large or complete author overlap between correct and actual match. This shows that authors sometimes title follow-up work very similarly, which leads to hard to distinguish cases.

Citation data coverage

For the 95% of our data set, where citing as well as cited document have a MAG ID, we are able to compare our citation data directly to the MAG. The composition of reference section coverage (i.e. how many of the references are reflected in each of the data sets) of all 994,351 citing documents can be seen in Fig.  5 . Of the combined 26,205,834 reference links, 9,829,797 are contained in both data sets (orange), 5,918,128 are in unarXive only (blue), and 10,457,909 are in the MAG only (green). On the document level we observe, that for 401,046 documents unarXive contains more references than the MAG, and for 545,048 it is the other way around. The striking difference between reference and document level Footnote 21 suggests, that the MAG has better coverage of large reference sections. This is supported by the fact that citing papers, where the MAG contains more references, cite on average 34.28 documents, while the same average for citing papers, where unarXive contains more references, is 17.46. Investigating further, in Fig.  6 we look at the number of citing documents in terms of reference section size (x-axis) and exclusive coverage in unarXive and MAG Footnote 22 (y-axis). As we can see (and as the almost exclusively blue area on the right hand side of Fig.  5 suggests), there is a large number of papers, citing \(\le 50\) documents, where \(\ge 80\%\) of the reference section are only contained in unarXive. Put differently, there is a large portion of documents, where the reference section is covered to some degree by unarXive, but has close to no coverage in the MAG. The number of citing documents, where the MAG contains 0 references whereas unarXive has \(\ge 1\) , is 215,291—these have an average of 15.1 references in unarXive. Footnote 23 The number of citing documents (within the 994,351 at hand), where unarXive contains 0 references whereas the MAG has \(\ge 1\) , is 0.

figure 5

Composition of reference section coverage for all citing documents (cut off at 100 cited documents)

figure 6

Distribution of citing documents in terms of reference section size and their coverage in unarXive and MAG (cut off at 750 cited documents)

Needless to say, additional references are only of value if they are valid. From both the citation links only found in unarXive, as well as those only found in the MAG, we therefore take a sample of 150 citing paper cited paper pairs and manually verify, if the former actually references the latter. This is done by inspecting the citing paper’s PDF and checking the entries in the reference section against the cited paper’s MAG record. Footnote 24 On the unarXive side, we observe 4 invalid links, all of which are cases similar to those showcased in Table  5 . On the MAG side, we observe 8 invalid links. Some of them seem to originate from the same challenges as the ones we face, e.g. similarly titled publications by same authors, leading to misidentified cited papers. Other error sources are, for instance, an invalid source for a citing paper being used and its reference section parsed (e.g. paper ID 1504647293 , where one of the PDF sources is the third author’s Ph.D. thesis instead of the described paper). Given that the citation links exclusive to unarXive appear to be half as noisy as those exclusive to the MAG, we argue that the 5,918,128 links only found in unarXive could be useful for citation and paper based tasks using MAG data. This would especially be the case for the field of physics, as it makes up a significant portion of our data set.

Analysis of citation flow and citation contexts

Because the documents in unarXive span multiple scientific disciplines, interdisciplinary analyses, such as the calculation of the flow of citations between disciplines, can be performed. Furthermore, the fact that documents are included as full text and citation markers within the text are linked to their respective cited documents, makes varied and fine grained study of citation contexts possible. To give further insight into our data set, we therefore conduct several such analyses in the following. Note that, for interdisciplinary investigations, disciplines other than physics, mathematics, and computer science are combined into other for space and legibility reasons, as they are only represented by a small number of publications. On the citing documents’ side, these span the fields of economics, electrical engineering and systems science, quantitative biology, quantitative finance, and statistics. Combined on the cited documents’ side are chemistry, biology, engineering, materials science, economics, geology, psychology, medicine, business, geography, sociology, political science, philosophy, environmental science, and art.

Citation flow

figure 7

Citation flow by discipline for 15.9 million references. The number of citing and cited documents per discipline are plotted on the sides

Figure  7 depicts the flow of citations by discipline for all 15.9 million matched references. As one would expect, publications in each field are cited the most from within the field itself. Notable is, that the incoming citations in mathematics are the most varied (physics and computer science combined make up 35% of the citations). As citation contexts are useful descriptive surrogates of the documents they refer to (Elkiss et al. 2008 ), a composition as varied as mathematics in Fig.  7 bears the question as to whether a distinction by discipline could be worth considering, when using citation contexts as descriptions of cited documents. That is, computer scientists and physicists might refer to math papers in a different way than mathematicians do. Borders between disciplines are, however, not necessarily clear cut, meaning that such a distinction might not be as straight forward as the color coding in Fig.  7 suggests.

Availability of citation contexts

figure 8

Normalized distribution of the number of citation contexts per cited document

Another aspect that becomes relevant, when using citation contexts to describe cited documents, is the number of citation contexts available per cited publication. Figure  8 shows, that the distribution of the number of citation contexts per cited document is similar across disciplines. In each discipline, around half of the cited documents are just mentioned once across all citing documents, 17.5% exactly twice, and so on. The tail of the distribution drops a bit slower for physics and mathematics. The mean values of citation contexts per cited document are 9.5 (SD 50.3) in physics, 7.0 (SD 28.8) in mathematics, 5.1 (SD 31.1) in computer science and 3.5 (SD 11.0) for the combined other fields. This leads to two conclusions. First, it suggests that a representation relying solely on citation contexts may only be viable for a small fraction of publications. Second, the high dispersion in the number of available citation contexts shows that means might not be very informative when it comes to citation counts aggregated over specific sets of documents.

Characteristics of citation contexts

For our analysis of the contents of citation contexts, we focus on three aspects: whether or not citations are (1) integral, (2) syntactic and (3) target section specific. These aspects were chosen, because they give particular insights into the citing behavior of researchers, as explained alongside the following definition of terms.

“Integral”, “syntactic” and “target section specific” citations

We first discuss the terms “integral” and “syntactic” , which are both established in existing literature. An integral citation is one, where the name of the cited document’s author appears within the citing sentence and has a grammatical role (Swales 1990 ; Hyland 1999 ) (e.g. “Swales [73] has argued that ...”). Similarly, a citation is syntactic, if the citation marker has has a grammatical role within the citing sentence (Whidby et al. 2011 ; Abu-Jbara and Radev 2012 ) (e.g. “According to [73] it is ...”). Integral citations are seen as an indication of emphasis towards the cited author (where the opposite direction would be towards the cited work) (Swales 1990 ; Hyland 1999 ). Syntactic citations are of interest, when determining how a citation relates to different parts of the citing sentence (Whidby et al. 2011 ; Abu-Jbara and Radev 2012 ). Both qualities are relevant when studying the role of citations (Färber and Sampath 2019 ).

Table  6 gives a more detailed account of both terms’ use in literature. Note that Lamers et al. ( 2018 ) provide a classification algorithm for integral and non-integral citations that slightly differs from Swales’ original definition depending on the interpretation of a citation marker’s scope, but also gives a clear classification in an edge case where Swales’ definition is unclear. Furthermore note, that the two ways for distinguishing syntactic and non-syntactic citations found in literature are not identical. This is in part because the method given by Abu-Jbara and Radev ( 2012 ) is kept rather simple. For the intents and purposes of our analysis we follow the definitions of Lamers et al. and Whidby et al. for “integral” and “syntactic” respectively.

As a third aspect for analysis, we define “target section specific” citations as those citations, where a specific section within the citation’s target (i.e. the cited document) is referred to. Examples are given in Table  7 . Target section specific citations are of interest for two reasons. First, in a similar fashion to integral citations, they are a particular form of citing behavior that might be used to infer characteristics of the relationship between citing author and cited document (e.g. a focus on the document rather than authors, or in depth engagement or familiarity with the cited document’s contents). Second, when using citation contexts as descriptions of cited documents, such as in citation context-based document summarization, target section specific citations might benefit from special handling, as their contexts only describe a (sometimes very narrow) part of the cited document.

In the following we will analyze all three aspects (integral, syntactic, target section specific) with respect to the different scientific disciplines covered by our data set.

Manual analysis of citation contexts

For each of the disciplines computer science, mathematics, physics, and other, we take a random sample of 300 citation contexts and manually label them with respect to being integral, syntactic, and target section specific. The result of this analysis is shown in Table  8 . Each of the assigned labels is most prevalent in mathematics papers, which is furthermore true for the co-occurrence of the labels integral and syntactic. Mathematics is also the only discipline, in which citations are more likely to be syntactic than not. The difference in frequency of integral and syntactic citations might be due to variations in writing culture between the different disciplines. We think that the comparatively high frequency of target section specific citations in mathematics could be due to the fact, that in mathematics intermediate results like corollaries and lemmata are immediately reusable in related work. We further investigate target section specific citations in the following section.

Automated analysis of target section specific citations

Sentences including a target section specific citation often follow distinct and predictable patterns. For example, a capitalized noun (e.g. “Corrolary” , “Lemma” , “Theorem” ) is followed by a number and a preposition (e.g. “in” , “of” ), and then followed by the citation marker (e.g. “Corrolary 3 in [73]” ). Another pattern is the citation marker followed by a capitalized noun and a number (e.g. “[73] Lemma 7” ). This lexical regularity allows us to identify target section specific citations in an automated fashion. Specifically, we search the entirety of our 29 M citation contexts for word sequences, that match either of the part of speech tag patterns \(\texttt {NNP~CD~IN~{<}citation marker{>}}\) and \(\texttt {{<}citation marker{>}~NNP~CD}\) . Doing this, we find 365,299 matches (1.25% of all contexts). This is less then the 2.31% one would expect due to the manual analysis Footnote 25 and suggests, that above two patterns are not exhaustive. Nevertheless we can use the identified contexts to further analyze them with respect to their distribution of disciplines.

Table  9 shows the results of this subsequent analysis. Because our data set does not contain equal numbers of citations from each discipline (cf. Fig.  7 ), we normalize the absolute numbers of pattern occurrences. Rows are then sorted by normalized ratio in decreasing order. Looking at the citing documents (those in which the pattern was found), we see a similar picture to the one in our manual analysis (shown in Table  8 ). Namely, mathematics with the highest count of target section specific citations by far, and a similar count for computer science and physics, where the latter is slightly lower. Counting by the cited documents (the document in which a specific part is being referenced), the differences decrease a little bit, but mathematics still occurs most frequently by far.

An interesting pattern emerges, when taking an even more detailed look and breaking these citations down by the disciplines on both sides of the citation relation. We then can observe the following.

The most determining factor for target section specific citations seems to be, that a mathematician is writing the document. \(^{\dagger }\) As with integral and syntactic citations, the writing culture of the field might play a role here.

The second most determining factor then appears to be, that a mathematical paper is being cited. \(^{\ddagger }\) Mathematics documents might lend themselves to being cited in this way.

The third most determining factor is an intra-discipline citation (i.e. the citing document is from that same discipline as the cited). This supports the interpretation of target section specific citations as a sign of familiarity with what is being cited (cf. section ‘Integral’, ‘syntactic’ and ‘target section specific’ citations ).

\(\hbox {Math}\rightarrow \hbox {Math}\) pairs, where all three of the above factors come into play simultaneously, consequentially show the highest occurrence of target section specific citations by far.

To summarize the results of our analysis of citation flow and citation contexts, we note the following points.

Publications in mathematics are cited from “outside the field” (e.g. by computer science or physics papers) to a comparatively high degree. Distinguishing citation contexts referring to mathematics publications by discipline might therefore be beneficial in certain applications (e.g. citation-based automated survey generation).

For most publications, only one or a few citation contexts are available.

Integral citations appear to be about twice as common in computer science as they are in physics, and again twice as common in mathematics as they are in computer science. Going with Swale’s interpretation of the phenomenon, this would mean the focus put on authors in mathematics is higher than in computer science, and higher in computer science than in physics.

In mathematics, syntactic citations seem to be more common that non-syntactic citations. This is beneficial for reference scope identification (Abu-Jbara and Radev 2012 ) and any sophisticated approaches based on citation contexts (like context-aware citation recommendation), as citation markers in syntactic citations stand in a grammatical relation to their surrounding words.

We define target section specific citations as those citations, where a specific section within the cited document is referred to. This type of citation is the most common in mathematics (comparing mathematics, computer science and physics). Through an subsequent analysis of 365k target section specific citations, we find that they are more common in intra-discipline citations than in inter-discipline citations. This supports our assumption that they are an indicator for familiarity with the cited document.

Our five criteria outlined in the beginning, namely size , cleanliness , global citation annotations , data set interlinkage , cross-domain coverage , in the end made it possible to reach above results. Without sufficient size, our results would be less informative. If our documents contained too much noise, the quality of reference resolution would have deteriorated. Global citation annotations, especially because of their word level precision, make fine grained lexical analyses of citation contexts like the one in section “ Automated analysis of target section specific citations ” possible. Without interlinking our data set to the MAG, available meta data would have been scarce. While we mainly focused on the scientific discipline information in the MAG, there is much more (authors, venues, etc.) that can be worked with in future analyses. Lastly, if our data set would have only covered a single scientific discipline, an analysis of citation flow, as well as interdisciplinary comparisons of citation context criteria would not have been possible.

Evaluating and applying approaches to research paper-based and citation-based tasks typically requires large, high-quality, citation-annotated, interlinked data sets. In this paper, we proposed a new data set with over one million papers’ full texts, 29.2 million annotated citations, and 29.2 million extracted citation contexts (of three sentences each), ready to be used by researchers and practitioners. We provide the data set and the implementation for creating the data set from arXiv source files online for further usage.

For the future, we plan to use the data set for a variety of tasks. Among others, we will develop a citation recommendation system based on all arXiv papers. Furthermore, we plan to perform additional analyses on citations and citation contexts across scientific disciplines, and to use the differences in citing behavior for enhanced citation recommendation.

See https://dblp.uni-trier.de/ .

See https://www.microsoft.com/en-us/research/project/microsoft-academic-graph/ and http://ma-graph.org .

The name is derived from the source name arXiv and the verb to unarchive , indicating the extraction of files from an archive.

To be more precise, the heterogeneity makes the usage of the data set as is unfeasible. Resolving references to a single consistent set of identifiers retrospectively would be an option, but comparatively challenging in the case of PubMed, because of the frequent usage of special notation in publication titles; see also: http://www.sciplore.org/files/citrec/CITREC_Parser_Documentation.pdf .

Hosted at http://citeulike.org/ until March 2019.

See https://data.mendeley.com/ .

See https://arxiv.org/stats/monthly_submissions .

Numbers as of February 2019.

I.e., the entries in the reference section of a publication. See Lst. 1 for examples.

The arXiv guidelines specifically suggest the omission of such (see https://arxiv.org/help/submit_tex#wegotem ).

See https://www-sop.inria.fr/marelle/tralics/packages.html#natbib .

See https://ctan.org/pkg/latexpand .

We also tested flatex ( https://ctan.org/pkg/flatex ) and flap ( https://github.com/fchauvel/flap ) but got the best results with latexpand.

The MAG record with ID 2763160969 appears to be a noisy duplicate caused by a web source with easily misinterpretable author information (only a partial list is displayed).

See https://www.crossref.org/ .

For title identification we also considered two other state of the art (Tkaczyk et al. 2018 ) tools, namely CERMINE (Tkaczyk et al. 2015 ) and GROBID (Lopez 2009 ). However, we found CERMINE to be considerably slower than the other tools. And while GROBID showed comparable speed and output quality in preliminary tests, Neural ParsCit’s tag based output format was more straightforward to integrate than the faceted TEI format structures that GROBID’s reference parser module returns.

See Python script extract_contexts.py bundled with the data set for details.

To assess whether or not the large percentage of reference strings without identified title is due to Neural ParsCit missing a lot of them, we manually check its output for a random sample of 100 papers (4027 reference strings). We find that 99% of cases with no title identified actually do not contain a title—like for example items (1), (2) and (4) in Lst. 1. These kind of references seem to be most common in physics papers. The 1% where a title was missed were largely references to non-English titles and books. We therefore conclude that the observed numbers largely reflect the actual state of reference strings rather than problems with the approach taken.

References that were successfully matched to a MAG record but have no associated citation markers (due to parsing errors; cf. section “ Challenges ”) are not counted here.

Further details can be found at https://github.com/IllDepence/unarXive/tree/master/doc/matching_evaluation .

While the number of reference links exclusive to the MAG is about twice as high as the number of reference links exclusive to unarXive, the number of documents for which either of the data sets has better coverage is on a comparable level.

Calculated as \(\frac{\#\text {citations only in unarXive } - \#\text {citations only in MAG}}{\#\text {citations in both } + \#\text {citations only in unarXive } + \#\text {citations only in MAG}}\) .

Manually looking into a sample of 100 of these documents, we find the most salient commonality to be irregularities w.r.t. to the reference section headline. 58 of the papers (55 physics, 2 quantitative biology, 1 CS) have no reference section headline, 2 have a double reference section headline and further 2 have the headline directly followed by a page break. The reason for the large number of MAG documents with no references might therefore be, that the PDF parser used can not yet deal with such cases.

Further details can be found at https://github.com/IllDepence/unarXive/tree/master/doc/coverage_evaluation .

Because disciplines are not equally represented in the data set, the expected value is not simply the average of values in Table  8 ( \(\frac{5+17+4+7}{4}\times 300^{-1}=0.0275\) ), but a weighted average \((5\times w_{\mathrm{cs}}+17\times w_{\mathrm{math}}+4\times w_{\mathrm{phys}}+7\times w_{\mathrm{other}})\times 300^{-1}\) , with \(\sum w_{\langle {\mathrm{discipline}}\rangle }=1\) . This gives a value of \(\approx 0.0231\) .

Abu-Jbara, A., & Radev, D. (2012). Reference scope identification in citing sentences. In Proceedings of the 2012 conference of the North American chapter of the association for computational linguistics: human language technologies, association for computational linguistics, Stroudsburg, PA, USA (pp. 80–90).

Abu-Jbara, A., Ezra, J., & Radev, D. (2013). Purpose and polarity of citation: Towards NLP-based bibliometrics. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: human language technologies, association for computational linguistics, Atlanta, Georgia (pp. 596–606).

Bast, H., & Korzen, C. (2017). A benchmark and evaluation for text extraction from PDF. In Proceedings of the 2017 ACM/IEEE joint conference on digital libraries , JCDL’17 (pp. 99–108).

Beel, J., Gipp, B., Langer, S., & Breitinger, C. (2016). Research-paper recommender systems: A literature survey. International Journal on Digital Libraries , 17 (4), 305–338. https://doi.org/10.1007/s00799-015-0156-0 .

Article   Google Scholar  

Bird, S., Dale, R., Dorr, B.J., Gibson, B.R., Joseph, M.T., Kan, M., Lee, D., Powley, B., Radev, D.R., & Tan, Y.F. (2008). The ACL anthology reference corpus: A reference dataset for bibliographic research in computational linguistics. In Proceedings of the sixth international conference on language resources and evaluation , LREC’08.

Brown, L. D., Cai, T. T., & DasGupta, A. (2001). Interval estimation for a binomial proportion. Statistical Science , 16 (2), 101–133.

Article   MathSciNet   Google Scholar  

Caragea, C., Wu, J., Ciobanu, A.M., Williams, K., Ramírez, J.P.F., Chen, H., Wu, Z., & Giles, C.L. (2014). CiteSeer x : A scholarly big dataset. In Proceedings of the 36th European conference on IR research , ECIR’14 (pp. 311–322).

Chakraborty, T., & Narayanam, R. (2016). All fingers are not equal: Intensity of references in scientific articles. In Proceedings of the 2016 conference on empirical methods in natural language processing, EMNLP’16 (pp. 1348–1358).

Chandrasekaran, M.K., Yasunaga, M., Radev, D.R., Freitag, D., & Kan, M. (2019). Overview and results: CL-SciSumm shared task 2019. In Proceedings of the 4th joint workshop on bibliometric-enhanced information retrieval and natural language processing for digital libraries , BIRNDL’19, (pp. 153–166).

Chen, J., & Zhuge, H. (2019). Automatic generation of related work through summarizing citations. Concurrency and Computation: Practice and Experience , 31 (3), e4261.

Duma, D., Klein, E., Liakata, M., Ravenscroft, J., & Clare, A. (2016). Rhetorical classification of anchor text for citation recommendation. D-Lib Magazine , 22 , 1.

Ebesu, T., & Fang, Y. (2017). Neural citation network for context-aware citation recommendation. In Proceedings of the 40th international ACM SIGIR conference on research and development in information retrieval , SIGIR’17, (pp. 1093–1096).

Elkiss, A., Shen, S., Fader, A., Erkan, G., States, D. J., & Radev, D. R. (2008). Blind men and elephants: What do citation summaries tell us about a research article? Journal of the Association for Information Science and Technology , 59 (1), 51–62. https://doi.org/10.1002/asi.20707 .

Färber, M., & Sampath, A. (2019). Determining how citations are used in citation contexts. In Proceedings of the 23th international conference on theory and practice of digital libraries , TPDL’19.

Färber, M., Thiemann, A., & Jatowt, A. (2018). A high-quality gold standard for citation-based tasks. In Proceedings of the 11th international conference on language resources and evaluation , LREC’18.

Galke, L., Mai, F., Vagliano, I., & Scherp, A. (2018). Multi-modal adversarial autoencoders for recommendations of citations and subject labels. In Proceedings of the 26th conference on user modeling, adaptation and personalization, ACM, New York, NY, USA, UMAP ’18 (pp. 197–205). https://doi.org/10.1145/3209219.3209236 .

Ghosh, S., Das, D., & Chakraborty, T. (2016). Determining sentiment in citation text and analyzing its impact on the proposed ranking index. In Proceedings of the 17th international conference on computational linguistics and intelligent text processing , CICLing’16 (pp. 292–306).

Gipp, B., Meuschke, N., & Lipinski, M. (2015). CITREC: An evaluation framework for citation-based similarity measures based on TREC genomics and PubMed central. In Proceedings of the iConference 2015 .

He, Q., Pei, J., Kifer, D., Mitra, P., & Giles, C.L. (2010). Context-aware citation recommendation. In Proceedings of the 19th international conference on world wide web , WWW’10, (pp. 421–430).

Huang, W., Wu, Z., Liang, C., Mitra, P., & Giles, C.L. (2015). A neural probabilistic model for context based citation recommendation. In Proceedings of the twenty-ninth AAAI conference on artificial intelligence, AAAI Press, AAAI’15 (pp. 2404–2410).

Huh, S. (2014). Journal article tag suite 1.0: National information standards organization standard of journal extensible markup language. Science Editing , 1 (2), 99–104. https://doi.org/10.6087/kcse.2014.1.99 .

Hyland, K. (1999). Academic attribution: Citation and the construction of disciplinary knowledge. Applied Linguistics , 20 (3), 341–367. https://doi.org/10.1093/applin/20.3.341 .

Lamers, W., Eck, N.J.v., Waltman, L., & Hoos, H. (2018). Patterns in citation context: the case of the field of scientometrics. In STI 2018 conference proceedings, centre for science and technology studies (CWTS) (pp 1114–1122).

Liang, L., Rousseau, R., & Zhong, Z. (2013). Non-english journals and papers in physics and chemistry: Bias in citations? Scientometrics , 95 (1), 333–350. https://doi.org/10.1007/s11192-012-0828-0 .

Liu, F., Hu, G., Tang, L., & Liu, W. (2018). The penalty of containing more non-english articles. Scientometrics , 114 (1), 359–366. https://doi.org/10.1007/s11192-017-2577-6 .

Lopez, P. (2009). GROBID: Combining automatic bibliographic data recognition and term extraction for scholarship publications. In Research and advanced technology for digital libraries (pp. 473–474). Berlin: Springer.

Mohammad, S., Dorr, B.J., Egan, M., Awadallah, A.H., Muthukrishnan, P., Qazvinian, V., Radev, D.R., Zajic, D.M. (2009). Using citations to generate surveys of scientific paradigms. In Proceedings of the 2009 annual conference of the North American chapter of the association for computational linguistics , NAACL-HLT’09, (pp. 584–592).

Mohapatra, D., Maiti, A., Bhatia, S., & Chakraborty, T. (2019). Go wide, go deep: Quantifying the impact of scientific papers through influence dispersion trees. In Proceedings of the 19th ACM/IEEE joint conference on digital libraries , JCDL’19 (pp. 305–314).

Moravcsik, M. J., & Murugesan, P. (1975). Some results on the function and quality of citations. Social Studies of Science , 5 (1), 86–92.

Nasar, Z., Jaffry, S. W., & Malik, M. K. (2018). Information extraction from scientific articles: A survey. Scientometrics , 117 (3), 1931–1990. https://doi.org/10.1007/s11192-018-2921-5 .

Prasad, A., Kaur, M., & Kan, M. Y. (2018). Neural ParsCit: A deep learning based reference string parser. International Journal on Digital Libraries , 19 , 323–337.

Radev, D. R., Muthukrishnan, P., Qazvinian, V., & Abu-Jbara, A. (2013). The ACL anthology network corpus. Language Resources and Evaluation , 47 (4), 919–944.

Reingewertz, Y., & Lutmar, C. (2018). Academic in-group bias: An empirical examination of the link between author and journal affiliation. Journal of Informetrics , 12 (1), 74–86. https://doi.org/10.1016/j.joi.2017.11.006 .

Roy, D., Ray, K., & Mitra, M. (2016). From a scholarly big dataset to a test collection for bibliographic citation recommendation. AAAI Workshops . https://www.aaai.org/ocs/index.php/WS/AAAIW16/paper/view/12635 .

Saier, T., & Färber, M. (2019). Bibliometric-enhanced arXiv: A data set for paper-based and citation-based tasks. In Proceedings of the 8th international workshop on bibliometric-enhanced information retrieval (BIR 2019) co-located with the 41st European conference on information retrieval (ECIR 2019), Cologne, Germany, April 14, 2019 , (pp. 14–26).

Sinha, A., Shen, Z., Song, Y., Ma, H., Eide, D., Hsu, B.P., & Wang, K. (2015). An overview of microsoft academic service (MAS) and applications. In Proceedings of the 24th international conference on world wide web , WWW’15, (pp. 243–246).

Sugiyama, K., & Kan, M. (2015). A comprehensive evaluation of scholarly paper recommendation using potential citation papers. International Journal on Digital Libraries , 16 (2), 91–109.

Swales, J. (1990). Genre analysis: English in academic and research settings . Cambridge: Cambridge University Press.

Google Scholar  

Tang, X., Wan, X., & Zhang, X. (2014). Cross-language context-aware citation recommendation in scientific articles. In Proceedings of the 37th international ACM SIGIR conference on research and development in information retrieval, ACM, SIGIR ’14 (pp. 817–826). https://doi.org/10.1145/2600428.2609564 .

Teufel, S., Siddharthan, A., & Tidhar, D. (2006a) An annotation scheme for citation function. In Proceedings of the 7th SIGdial workshop on discourse and dialogue, association for computational linguistics, SigDIAL ’06 (pp. 80–87).

Teufel, S., Siddharthan, A., & Tidhar, D. (2006b) Automatic classification of citation function. In Proceedings of the 2006 conference on empirical methods in natural language processing , EMNLP’06, (pp. 103–110).

Tkaczyk, D., Szostek, P., Fedoryszak, M., Dendek, P. J., & Bolikowski, L. (2015). CERMINE: Automatic extraction of structured metadata from scientific literature. International Journal on Document Analysis and Recognition (IJDAR) , 18 (4), 317–335.

Tkaczyk, D., Collins, A., Sheridan, P., & Beel, J. (2018). Machine learning vs. rules and out-of-the-box vs. retrained: An evaluation of open-source bibliographic reference and citation parsers. In Proceedings of the 18th ACM/IEEE on joint conference on digital libraries, ACM, New York, NY, USA, JCDL ’18 (pp. 99–108). https://doi.org/10.1145/3197026.3197048 .

Valenzuela, M., Ha, V., & Etzioni, O. (2015). Identifying meaningful citations. AAAI Workshops . https://www.aaai.org/ocs/index.php/WS/AAAIW15/paper/view/10185 .

Whidby, M., Zajic, D., & Dorr, B. (2011). Citation handling for improved summarization of scientific documents. Tech. rep.

Download references

Acknowledgements

Open Access funding provided by Projekt DEAL.

Author information

Authors and affiliations.

Institute AIFB, Karlsruhe Institute of Technology (KIT), Karlsruhe, Germany

Tarek Saier & Michael Färber

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Tarek Saier .

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/ .

Reprints and permissions

About this article

Saier, T., Färber, M. unarXive : a large scholarly data set with publications’ full-text, annotated in-text citations, and links to metadata. Scientometrics 125 , 3085–3108 (2020). https://doi.org/10.1007/s11192-020-03382-z

Download citation

Received : 30 September 2019

Published : 02 March 2020

Issue Date : December 2020

DOI : https://doi.org/10.1007/s11192-020-03382-z

Share this article

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Scholarly data
  • Digital libraries
  • Find a journal
  • Publish with us
  • Track your research

research paper data set

  • Research and Course Guides
  • Statistics Library Resources

Sample Datasets

Statistics library resources: sample datasets.

  • Getting Started
  • Types of Data
  • Datasets & Statistics Guide This link opens in a new window
  • Cite Your Data
  • Literature Review
  • Making Academic Arguments
  • APA Citation Style This link opens in a new window
  • Get to Full-Text Faster

The resources here constitute test, sample, or practice data, or representative samples of live datasets that may be used in teaching and learning statistical analysis techniques. I recommend using these when the actual topic or question of your research is secondary to learning the techniques--if that is not the case, and the substance of the data matters, see our Datasets & Statistics guide or contact our Data Services Librarian. 

  • CORGIS CORGIS is The Collection of Really Great, Interesting, Situated Datasets, compiled by instructors at Virginia Tech.
  • R Datasets A Github repository of datasets available through R packages (the download files will work in any statistical analysis tool). You can see the number of rows, the number of columns, and also how many columns are binary, character, numeric, etc. Includes download (.csv) and documentation links.
  • Dataset and Story Library (DASL) Sample datasets organized by a by a Cornell University statistics professor.
  • Tableau Sample Datasets Practice data collected by Tableau.
  • FiveThirtyEight Our Data page An archive of the data and code behind many of the website's articles and graphics.
  • FuelEconomy.gov Download fuel economy datasets by make, model, other variables.
  • Kaggle Datasets Open dataset repository.
  • Kickstarter Project Data Via our ICPSR membership. Create a free personal account using your St. Thomas email to download.
  • Market Values of College and University Endowments Values of endowment funds at U.S. colleges and universities, with classifications. From the National Association of College and University Business Officers (NACUBO). See also their Research page for related studies.
  • National Database of Childcare Prices Sponsored by the Department of Labor Women's Bureau, the site contains a worksheet of county-level childcare price data from across the U.S. dating back to 2008, along with a Technical Guide (codebook) and associated research.
  • Office of the State Auditor: Municipal Liquor Store Operations Data Annual data on the financial operations of city-owned liquor stores across Minnesota; this site includes raw data files and narrative reports. Includes both quantitative and categorical variables.
  • Sample Sales Data This link is simply a Google search for "sample sales data by store", which will give you a number of possibilities
  • Tableau Public Sample Datasets Tableau Public is a platform for sharing data visualizations made in their desktop software. Sample data for learning. Free personal account needed. If you have the desktop version you can download the underlying datasets from the visualizations.
  • << Previous: APA Citation Style
  • Next: Get to Full-Text Faster >>
  • Last Updated: Mar 12, 2024 3:21 PM
  • URL: https://libguides.stthomas.edu/stat_courses

© 2023 University of St. Thomas, Minnesota

  • Interlibrary Loan and Scan & Deliver
  • Course Reserves
  • Purchase Request
  • Collection Development & Maintenance
  • Current Negotiations
  • Ask a Librarian
  • Instructor Support
  • Library How-To
  • Research Guides
  • Research Support
  • Study Rooms
  • Research Rooms
  • Partner Spaces
  • Loanable Equipment
  • Print, Scan, Copy
  • 3D Printers
  • Poster Printing
  • OSULP Leadership
  • Strategic Plan

Research Data Services

  • Campus Services & Policies
  • Archiving & Preservation
  • Citing Datasets
  • Data Papers & Journals

Data Papers & Data Journals

  • Data Repositories
  • ScholarsArchive@OSU data repository
  • Data Storage & Backup
  • Data Types & File Formats
  • Defining Data
  • File Organization
  • IP & Licensing Data
  • Laboratory Notebooks
  • Research Lifecycle
  • Researcher Identifiers
  • Sharing Your Data
  • Metadata/Documentation
  • Tools & Resources

SEND US AN EMAIL

  • L.K. Borland Data Management Support Coordinator Schedule an appointment with me!
  • Diana Castillo College of Business/Social Sciences Data Librarian Assistant Professor 541-737-9494
  • Clara Llebot Lorente Data Management Specialist Assistant Professor 541-737-1192 On sabbatical through June 2024

The rise of the "data paper"

Datasets are increasingly being recognized as scholarly products in their own right, and as such, are now being submitted for standalone publication. In many cases, the greatest value of a dataset lies in sharing it, not necessarily in providing interpretation or analysis. For example, this paper presents a global database of the abundance, biomass, and nitrogen fixation rates of marine diazotrophs. This benchmark dataset, which will continue to evolve over time, is a valuable standalone research product that has intrinsic value. Under traditional publication models, this dataset would not be considered "publishable" because it doesn't present novel research or interpretation of results. Data papers facilitate the sharing of data in a standardized framework that provides value, impact, and recognition for authors. Data papers also provide much more thorough context and description than datasets that are simply deposited to a repository (which may have very minimal metadata requirements).

What is a data paper?

Data papers thoroughly describe datasets, and do not usually include any interpretation or discussion (an exception may be discussion of different methods to collect the data, e.g.). Some data papers are published in a distinct “Data Papers” section of a well-established journal (see this article in Ecology, for example). It is becoming more common, however, to see journals that exclusively focus on the publication of datasets. The purpose of a data journal is to provide quick access to high-quality datasets that are of broad interest to the scientific community. They are intended to facilitate reuse of the dataset, which increases its original value and impact, and speeds the pace of research by avoiding unintentional duplication of effort.

Are data papers peer-reviewed?

Data papers typically go through a peer review process in the same manner as articles, but being new to scientific practice, the quality and scope of the process is variable across publishers. A good example of a peer reviewed data journal is Earth System Science Data ( ESSD ). Their review guidelines are well described and aren't all that different from manuscript review guidelines that we are all already familiar with.

You might wonder, W hat is the difference between a 'data paper' and a 'regular article + dataset published in a public repository' ? The answer to that isn’t always clear. Some data papers necessitate just as much preparation as, and are of equal quality to, ‘typical’ journal articles. Some data papers are brief, and only present enough metadata and descriptive content to make the dataset understandable and reusable. In most cases however, the datasets or databases presented in data papers include much more description than datasets deposited to a repository, even if those datasets were deposited to support a manuscript. Common practices and standards are evolving in the realm of data papers and data journals, but for now, they are the Wild West of data sharing.

Where do the data from data papers live?

Data preservation is a corollary of data papers, not their main purpose. Most data journals do not archive data in-house. Instead, they generally require that authors submit the dataset to a repository. These repositories archive the data, provide persistent access, and assign the dataset a unique identifier (DOI). Repositories do not always require that the dataset(s) be linked with a publication (data paper or ‘typical’ paper; Dryad does require one), but if you’re going to the trouble of submitting a dataset to a repository, consider exploring the option of publishing a data paper to support it.

How can I find data journals?

The article by Walters (2020) has a list of data journals in their appendix, and differentiates between "pure" data journals and journals that publish data reports but are devoted mainly to other types of contributions. They also update previous lists of data journals ( Candela et al, 2015 ).

Walters, William H.. 2020. “Data Journals: Incentivizing Data Access and Documentation Within the Scholarly Communication System”.  Insights  33 (1): 18. DOI:  http://doi.org/10.1629/uksg.510

Candela, L., Castelli, D., Manghi, P., & Tani, A. (2015). Data journals: A survey. Journal of the Association for Information Science and Technology , 66 (9), 1747–1762. https://doi.org/10.1002/asi.23358  

This blog post by Katherine Akers , from 2014, also has a long list of existing data journals.

  • << Previous: Citing Datasets
  • Next: Data Repositories >>
  • Last Updated: Aug 30, 2023 9:25 AM
  • URL: https://guides.library.oregonstate.edu/research-data-services

research paper data set

Contact Info

121 The Valley Library Corvallis OR 97331–4501

Phone: 541-737-3331

Services for Persons with Disabilities

In the Valley Library

  • Oregon State University Press
  • Special Collections and Archives Research Center
  • Undergrad Research & Writing Studio
  • Graduate Student Commons
  • Tutoring Services
  • Northwest Art Collection

Digital Projects

  • Oregon Explorer
  • Oregon Digital
  • ScholarsArchive@OSU
  • Digital Publishing Initiatives
  • Atlas of the Pacific Northwest
  • Marilyn Potts Guin Library  
  • Cascades Campus Library
  • McDowell Library of Vet Medicine

FDLP Emblem

Help | Advanced Search

Computer Science > Machine Learning

Title: integrated gradient correlation: a dataset-wise attribution method.

Abstract: Attribution methods are primarily designed to study the distribution of input component contributions to individual model predictions. However, some research applications require a summary of attribution patterns across the entire dataset to facilitate the interpretability of the scrutinized models. In this paper, we present a new method called Integrated Gradient Correlation (IGC) that relates dataset-wise attributions to a model prediction score and enables region-specific analysis by a direct summation over associated components. We demonstrate our method on scalar predictions with the study of image feature representation in the brain from fMRI neural signals and the estimation of neural population receptive fields (NSD dataset), as well as on categorical predictions with the investigation of handwritten digit recognition (MNIST dataset). The resulting IGC attributions show selective patterns, revealing underlying model strategies coherent with their respective objectives.

Submission history

Access paper:.

  • HTML (experimental)
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

BibTeX formatted citation

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

Logo for The Wharton School

  • Youth Program
  • Wharton Online

Research Papers / Publications

Numbers, Facts and Trends Shaping Your World

Read our research on:

Full Topic List

Regions & Countries

  • Publications
  • Our Methods
  • Short Reads
  • Tools & Resources

Read Our Research On:

Partisan divides over K-12 education in 8 charts

Proponents and opponents of teaching critical race theory attend a school board meeting in Yorba Linda, California, in November 2021. (Robert Gauthier/Los Angeles Times via Getty Images)

K-12 education is shaping up to be a key issue in the 2024 election cycle. Several prominent Republican leaders, including GOP presidential candidates, have sought to limit discussion of gender identity and race in schools , while the Biden administration has called for expanded protections for transgender students . The coronavirus pandemic also brought out partisan divides on many issues related to K-12 schools .

Today, the public is sharply divided along partisan lines on topics ranging from what should be taught in schools to how much influence parents should have over the curriculum. Here are eight charts that highlight partisan differences over K-12 education, based on recent surveys by Pew Research Center and external data.

Pew Research Center conducted this analysis to provide a snapshot of partisan divides in K-12 education in the run-up to the 2024 election. The analysis is based on data from various Center surveys and analyses conducted from 2021 to 2023, as well as survey data from Education Next, a research journal about education policy. Links to the methodology and questions for each survey or analysis can be found in the text of this analysis.

Most Democrats say K-12 schools are having a positive effect on the country , but a majority of Republicans say schools are having a negative effect, according to a Pew Research Center survey from October 2022. About seven-in-ten Democrats and Democratic-leaning independents (72%) said K-12 public schools were having a positive effect on the way things were going in the United States. About six-in-ten Republicans and GOP leaners (61%) said K-12 schools were having a negative effect.

A bar chart that shows a majority of Republicans said K-12 schools were having a negative effect on the U.S. in 2022.

About six-in-ten Democrats (62%) have a favorable opinion of the U.S. Department of Education , while a similar share of Republicans (65%) see it negatively, according to a March 2023 survey by the Center. Democrats and Republicans were more divided over the Department of Education than most of the other 15 federal departments and agencies the Center asked about.

A bar chart that shows wide partisan differences in views of most federal agencies, including the Department of Education.

In May 2023, after the survey was conducted, Republican lawmakers scrutinized the Department of Education’s priorities during a House Committee on Education and the Workforce hearing. The lawmakers pressed U.S. Secretary of Education Miguel Cardona on topics including transgender students’ participation in sports and how race-related concepts are taught in schools, while Democratic lawmakers focused on school shootings.

Partisan opinions of K-12 principals have become more divided. In a December 2021 Center survey, about three-quarters of Democrats (76%) expressed a great deal or fair amount of confidence in K-12 principals to act in the best interests of the public. A much smaller share of Republicans (52%) said the same. And nearly half of Republicans (47%) had not too much or no confidence at all in principals, compared with about a quarter of Democrats (24%).

A line chart showing that confidence in K-12 principals in 2021 was lower than before the pandemic — especially among Republicans.

This divide grew between April 2020 and December 2021. While confidence in K-12 principals declined significantly among people in both parties during that span, it fell by 27 percentage points among Republicans, compared with an 11-point decline among Democrats.

Democrats are much more likely than Republicans to say teachers’ unions are having a positive effect on schools. In a May 2022 survey by Education Next , 60% of Democrats said this, compared with 22% of Republicans. Meanwhile, 53% of Republicans and 17% of Democrats said that teachers’ unions were having a negative effect on schools. (In this survey, too, Democrats and Republicans include independents who lean toward each party.)

A line chart that show from 2013 to 2022, Republicans' and Democrats' views of teachers' unions grew further apart.

The 38-point difference between Democrats and Republicans on this question was the widest since Education Next first asked it in 2013. However, the gap has exceeded 30 points in four of the last five years for which data is available.

Republican and Democratic parents differ over how much influence they think governments, school boards and others should have on what K-12 schools teach. About half of Republican parents of K-12 students (52%) said in a fall 2022 Center survey that the federal government has too much influence on what their local public schools are teaching, compared with two-in-ten Democratic parents. Republican K-12 parents were also significantly more likely than their Democratic counterparts to say their state government (41% vs. 28%) and their local school board (30% vs. 17%) have too much influence.

A bar chart showing Republican and Democratic parents have different views of the influence government, school boards, parents and teachers have on what schools teach

On the other hand, more than four-in-ten Republican parents (44%) said parents themselves don’t have enough influence on what their local K-12 schools teach, compared with roughly a quarter of Democratic parents (23%). A larger share of Democratic parents – about a third (35%) – said teachers don’t have enough influence on what their local schools teach, compared with a quarter of Republican parents who held this view.

Republican and Democratic parents don’t agree on what their children should learn in school about certain topics. Take slavery, for example: While about nine-in-ten parents of K-12 students overall agreed in the fall 2022 survey that their children should learn about it in school, they differed by party over the specifics. About two-thirds of Republican K-12 parents said they would prefer that their children learn that slavery is part of American history but does not affect the position of Black people in American society today. On the other hand, 70% of Democratic parents said they would prefer for their children to learn that the legacy of slavery still affects the position of Black people in American society today.

A bar chart showing that, in 2022, Republican and Democratic parents had different views of what their children should learn about certain topics in school.

Parents are also divided along partisan lines on the topics of gender identity, sex education and America’s position relative to other countries. Notably, 46% of Republican K-12 parents said their children should not learn about gender identity at all in school, compared with 28% of Democratic parents. Those shares were much larger than the shares of Republican and Democratic parents who said that their children should not learn about the other two topics in school.

Many Republican parents see a place for religion in public schools , whereas a majority of Democratic parents do not. About six-in-ten Republican parents of K-12 students (59%) said in the same survey that public school teachers should be allowed to lead students in Christian prayers, including 29% who said this should be the case even if prayers from other religions are not offered. In contrast, 63% of Democratic parents said that public school teachers should not be allowed to lead students in any type of prayers.

Bar charts that show nearly six-in-ten Republican parents, but fewer Democratic parents, said in 2022 that public school teachers should be allowed to lead students in prayer.

In June 2022, before the Center conducted the survey, the Supreme Court ruled in favor of a football coach at a public high school who had prayed with players at midfield after games. More recently, Texas lawmakers introduced several bills in the 2023 legislative session that would expand the role of religion in K-12 public schools in the state. Those proposals included a bill that would require the Ten Commandments to be displayed in every classroom, a bill that would allow schools to replace guidance counselors with chaplains, and a bill that would allow districts to mandate time during the school day for staff and students to pray and study religious materials.

Mentions of diversity, social-emotional learning and related topics in school mission statements are more common in Democratic areas than in Republican areas. K-12 mission statements from public schools in areas where the majority of residents voted Democratic in the 2020 general election are at least twice as likely as those in Republican-voting areas to include the words “diversity,” “equity” or “inclusion,” according to an April 2023 Pew Research Center analysis .

A dot plot showing that public school district mission statements in Democratic-voting areas mention some terms more than those in areas that voted Republican in 2020.

Also, about a third of mission statements in Democratic-voting areas (34%) use the word “social,” compared with a quarter of those in Republican-voting areas, and a similar gap exists for the word “emotional.” Like diversity, equity and inclusion, social-emotional learning is a contentious issue between Democrats and Republicans, even though most K-12 parents think it’s important for their children’s schools to teach these skills . Supporters argue that social-emotional learning helps address mental health needs and student well-being, but some critics consider it emotional manipulation and want it banned.

In contrast, there are broad similarities in school mission statements outside of these hot-button topics. Similar shares of mission statements in Democratic and Republican areas mention students’ future readiness, parent and community involvement, and providing a safe and healthy educational environment for students.

  • Education & Politics
  • Partisanship & Issues
  • Politics & Policy

Jenn Hatfield is a writer/editor at Pew Research Center

Most Americans think U.S. K-12 STEM education isn’t above average, but test results paint a mixed picture

About 1 in 4 u.s. teachers say their school went into a gun-related lockdown in the last school year, about half of americans say public k-12 education is going in the wrong direction, what public k-12 teachers want americans to know about teaching, what’s it like to be a teacher in america today, most popular.

1615 L St. NW, Suite 800 Washington, DC 20036 USA (+1) 202-419-4300 | Main (+1) 202-857-8562 | Fax (+1) 202-419-4372 |  Media Inquiries

Research Topics

  • Age & Generations
  • Coronavirus (COVID-19)
  • Economy & Work
  • Family & Relationships
  • Gender & LGBTQ
  • Immigration & Migration
  • International Affairs
  • Internet & Technology
  • Methodological Research
  • News Habits & Media
  • Non-U.S. Governments
  • Other Topics
  • Race & Ethnicity
  • Email Newsletters

ABOUT PEW RESEARCH CENTER  Pew Research Center is a nonpartisan fact tank that informs the public about the issues, attitudes and trends shaping the world. It conducts public opinion polling, demographic research, media content analysis and other empirical social science research. Pew Research Center does not take policy positions. It is a subsidiary of  The Pew Charitable Trusts .

Copyright 2024 Pew Research Center

Terms & Conditions

Privacy Policy

Cookie Settings

Reprints, Permissions & Use Policy

Thank you for visiting nature.com. You are using a browser version with limited support for CSS. To obtain the best experience, we recommend you use a more up to date browser (or turn off compatibility mode in Internet Explorer). In the meantime, to ensure continued support, we are displaying the site without styles and JavaScript.

  • View all journals
  • Explore content
  • About the journal
  • Publish with us
  • Sign up for alerts

Latest science news, discoveries and analysis

research paper data set

The Maldives is racing to create new land. Why are so many people concerned?

research paper data set

Mini-colon and brain 'organoids' shed light on cancer and other diseases

research paper data set

Retractions are part of science, but misconduct isn’t — lessons from a superconductivity lab

research paper data set

Monkeypox virus: dangerous strain gains ability to spread through sex, new data suggest

Dna from ancient graves reveals the culture of a mysterious nomadic people, atomic clock keeps ultra-precise time aboard a rocking naval ship, who redefines airborne transmission: what does that mean for future pandemics, ecologists: don’t lose touch with the joy of fieldwork chris mantegna, european ruling linking climate change to human rights could be a game changer — here’s how charlotte e. blattner.

research paper data set

Lethal AI weapons are here: how can we control them?

research paper data set

Living on Mars would probably suck — here's why

research paper data set

Dozens of genes are linked to post-traumatic stress disorder

research paper data set

What toilets can reveal about COVID, cancer and other health threats

How gliding marsupials got their ‘wings’, plastic pollution: three numbers that support a crackdown, first glowing animals lit up the oceans half a billion years ago, how to freeze a memory: putting worms on ice stops them forgetting.

research paper data set

Any plan to make smoking obsolete is the right step

research paper data set

Will AI accelerate or delay the race to net-zero emissions?

research paper data set

Citizenship privilege harms science

We must protect the global plastics treaty from corporate interference martin wagner, un plastics treaty: don’t let lobbyists drown out researchers, current issue.

Issue Cover

Surprise hybrid origins of a butterfly species

Stripped-envelope supernova light curves argue for central engine activity, optical clocks at sea, research analysis.

research paper data set

A chemical method for selective labelling of the key amino acid tryptophan

research paper data set

Charles Darwin investigates: the curious case of primrose punishment

research paper data set

Nanoparticle fix opens up tricky technique to forensic applications

research paper data set

Coupled neural activity controls working memory in humans

Robust optical clocks promise stable timing in a portable package, targeting rna opens therapeutic avenues for timothy syndrome, bioengineered ‘mini-colons’ shed light on cancer progression, ancient dna traces family lines and political shifts in the avar empire.

research paper data set

Breaking ice, and helicopter drops: winning photos of working scientists

research paper data set

Shrouded in secrecy: how science is harmed by the bullying and harassment rumour mill

research paper data set

Londoners see what a scientist looks like up close in 50 photographs

How ground glass might save crops from drought on a caribbean island, deadly diseases and inflatable suits: how i found my niche in virology research, books & culture.

research paper data set

How volcanoes shaped our planet — and why we need to be ready for the next big eruption

research paper data set

Dogwhistles, drilling and the roots of Western civilization: Books in brief

research paper data set

Cosmic rentals

Las boriqueñas remembers the forgotten puerto rican women who tested the first pill, dad always mows on summer saturday mornings, nature podcast.

Nature Podcast

Latest videos

Nature briefing.

An essential round-up of science news, opinion and analysis, delivered to your inbox every weekday.

research paper data set

Quick links

  • Explore articles by subject
  • Guide to authors
  • Editorial policies

IMAGES

  1. Tables in Research Paper

    research paper data set

  2. Statistical Analysis of Data with report writing

    research paper data set

  3. FREE 10+ Sample Data Analysis Templates in PDF

    research paper data set

  4. (PDF) An analysis of data paper templates and guidelines: Types of

    research paper data set

  5. FREE 42+ Research Paper Examples in PDF

    research paper data set

  6. How to Write a Research Paper in English

    research paper data set

VIDEO

  1. how to collect data for research paper l A step-by-step guide to data collection

  2. Generate Data Science/Data Analysis Report of your DataSet in 5 Minutes

  3. how to collect data for a Research paper

  4. How to find datasets for your research

  5. Quantitative Data Analysis 101 Tutorial: Descriptive vs Inferential Statistics (With Examples)

  6. How to read a Research Paper ? Made easy for young researchers

COMMENTS

  1. Dataset Search

    Dataset Search. Try coronavirus covid-19 or water quality site:canada.ca. Learn more about Dataset Search.

  2. Machine Learning Datasets

    Stay informed on the latest trending ML papers with code, research developments, libraries, methods, and datasets. ... National Institute of Standards and Technology database) is a large collection of handwritten digits. It has a training set of 60,000 examples, and a test set of 10,000 examples. ... and void). The dataset consists of around ...

  3. Data Sets for Quantitative Research: Public Use Datasets

    Includes databases, data files, CD-ROM, etc. available for purchase. Harvard DataVerse; r3Data.org Registry of Research Data Repositories; Open Data: European Commission Launches European Data Portal (over 1 million datasets From 36 countries) Awesome Public Datasets (on github)*. Includes a mix of free and pay resources.

  4. Datasets

    ScreenQA Short. The dataset is a modification of the original ScreenQA dataset. It contains the same ~86K questions for ~35K screenshots from Rico, but the ground truth is a list of short answers. It should be used to train and evaluate models capable of screen content understanding via question answering.

  5. Scientific Data

    Scientific Data is an open access journal dedicated to data, publishing descriptions of research datasets and articles on research data sharing from all areas ...

  6. Datasets

    Datasets. A dataset (also spelled 'data set') is a collection of raw statistics and information generated by a research study. Datasets produced by government agencies or non-profit organizations can usually be downloaded free of charge. However, datasets developed by for-profit companies may be available for a fee.

  7. Everything you always wanted to know about a dataset: Studies in data

    In this paper we consider the latter. Our research, as much of the related work in human data interaction, is based on the assumption that, ... The data set records whether they are alive or dead characters, their gender, their characteristics (like: hair and eye colour). The data set records if the character has a secret identity [.] (and ...

  8. arxiv_dataset · Datasets at Hugging Face

    Data Instances. This dataset is a mirror of the original ArXiv data. Because the full dataset is rather large (1.1TB and growing), this dataset provides only a metadata file in the json format. An example is given below. {'id': '0704.0002', 'submitter': 'Louis Theran', 'authors': 'Ileana Streinu and Louis Theran', 'title': 'Sparsity-certifying ...

  9. Google Dataset Search: Building a search engine for ...

    In this paper, we discuss Google Dataset Search, a dataset-discovery tool that provides search capabilities over potentially all datasets published on the Web. The approach relies on an open ecosystem,where dataset owners and providers publish semantically enhanced metadata on their own sites. We then aggregate, normalize, and reconcile this ...

  10. A dataset describing data discovery and reuse practices in research

    This paper presents a dataset produced from the largest known survey examining how researchers and support professionals discover, make sense of and reuse secondary research data. 1677 respondents ...

  11. arXiv Summarization Dataset Dataset

    arXiv Summarization Dataset. Introduced by Cohan et al. in A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. This is a dataset for evaluating summarisation methods for research papers. Source: A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents. Homepage.

  12. Public Use Data Archive

    The NBER Public Use Data Archive is an eclectic mix of public-use economic, demographic, and enterprise data obtained over the years to satisfy the specific requests of NBER-affiliated researchers for particular projects. Files here are often in more convenient formats than the original data source. However, files that receive updates at the ...

  13. Find Open Datasets and Machine Learning Projects

    Download Open Datasets on 1000s of Projects + Share Projects on One Platform. Explore Popular Topics Like Government, Sports, Medicine, Fintech, Food, More. Flexible Data Ingestion.

  14. data

    It is becoming common for authors to upload raw data of their research when publishing their papers. However, it is still a small fraction of papers include the dataset. ... browse the database until you find an interesting data set, and the paper will likely be referenced by the database entry (if it is a serious database). Here is an analogy ...

  15. unarXive: a large scholarly data set with publications' full-text

    In recent years, scholarly data sets have been used for various purposes, such as paper recommendation, citation recommendation, citation context analysis, and citation context-based document summarization. The evaluation of approaches to such tasks and their applicability in real-world scenarios heavily depend on the used data set. However, existing scholarly data sets are limited in several ...

  16. arXiv Dataset

    arXiv dataset and metadata of 1.7M+ scholarly papers across STEM. code. New Notebook. table_chart. New Dataset. tenancy. New Model. emoji_events. New Competition. corporate_fare. New Organization. No Active Events. Create notebooks and keep track of their status here. add New Notebook. auto_awesome_motion. 0 Active Events. expand_more. menu ...

  17. Sample Datasets

    The resources here constitute test, sample, or practice data, or representative samples of live datasets that may be used in teaching and learning statistical analysis techniques. I recommend using these when the actual topic or question of your research is secondary to learning the techniques--if that is not the case, ...

  18. LibGuides: Research Data Services: Data Papers & Journals

    Data preservation is a corollary of data papers, not their main purpose. Most data journals do not archive data in-house. Instead, they generally require that authors submit the dataset to a repository. These repositories archive the data, provide persistent access, and assign the dataset a unique identifier (DOI).

  19. Download Datasets

    Pew Research Center makes its data available to the public for secondary analysis after a period of time. See this post for more information on how to use our datasets and contact us at [email protected] with any questions. Find a dataset by research area: U.S. Politics & Policy. Journalism & Media. Internet & Tech. Science & Society.

  20. Learning to Do Qualitative Data Analysis: A Starting Point

    Transcribing a data set can feel overwhelming and it may be tempting (and at times necessary) to outsource this activity to a professional transcriptionist. ... On the basis of Rocco (2010), Storberg-Walker's (2012) amended list on qualitative data analysis in research papers included the following: (a) the article should provide enough ...

  21. (PDF) A survey of iris datasets

    In this paper, we provide a comprehensive overview of the existing publicly available datasets and their popularity in the research community using a bibliometric approach. We reviewed 158 ...

  22. The Beginner's Guide to Statistical Analysis

    Table of contents. Step 1: Write your hypotheses and plan your research design. Step 2: Collect data from a sample. Step 3: Summarize your data with descriptive statistics. Step 4: Test hypotheses or make estimates with inferential statistics.

  23. [2404.12720] PDF-MVQA: A Dataset for Multimodal Information Retrieval

    Document Question Answering (QA) presents a challenge in understanding visually-rich documents (VRD), particularly those dominated by lengthy textual content like research journal articles. Existing studies primarily focus on real-world documents with sparse text, while challenges persist in comprehending the hierarchical semantic relations among multiple pages to locate multimodal components ...

  24. [2404.13910] Integrated Gradient Correlation: a Dataset-wise

    Attribution methods are primarily designed to study the distribution of input component contributions to individual model predictions. However, some research applications require a summary of attribution patterns across the entire dataset to facilitate the interpretability of the scrutinized models. In this paper, we present a new method called Integrated Gradient Correlation (IGC) that ...

  25. Towards Gender Harmony Dataset: Gender Beliefs and Gender ...

    The data comprising the TGH project results are stored in a single table. The data table is available in the repository 29 in three formats: csv, xlsx, and Rda. The dataset contains 33,313 ...

  26. Research Papers / Publications

    Research Papers / Publications. Xinmeng Huang, Shuo Li, Mengxin Yu, Matteo Sesia, Seyed Hamed Hassani, Insup Lee, Osbert Bastani, Edgar Dobriban, Uncertainty in Language Models: Assessment through Rank-Calibration. Patrick Chao, Edoardo Debenedetti, Alexander Robey, Maksym Andriushchenko, Francesco Croce, Vikash Sehwag, Edgar Dobriban, Nicolas ...

  27. Design of highly functional genome editors by modeling the ...

    Gene editing has the potential to solve fundamental challenges in agriculture, biotechnology, and human health. CRISPR-based gene editors derived from microbes, while powerful, often show significant functional tradeoffs when ported into non-native environments, such as human cells. Artificial intelligence (AI) enabled design provides a powerful alternative with potential to bypass ...

  28. TV100: A TV Series Dataset that Pre-Trained CLIP Has Not Seen

    This paper seeks to address this crucial inquiry. In line with our objective, we have made publicly available a novel dataset comprised of images from TV series released post-2021. This dataset holds significant potential for use in various research areas, including the evaluation of incremental learning, novel class discovery, and long-tailed ...

  29. How Democrats, Republicans differ over K-12 education

    Pew Research Center conducted this analysis to provide a snapshot of partisan divides in K-12 education in the run-up to the 2024 election. The analysis is based on data from various Center surveys and analyses conducted from 2021 to 2023, as well as survey data from Education Next, a research journal about education policy.

  30. Latest science news, discoveries and analysis

    Find breaking science news and analysis from the world's leading research journal.