Data for researchers: Extracted text from more than 89,000 FDA medical device 510(k) clearance packages

510(k)

datasets

FDA

medical devices

regulatory science

Author

Brendan O’Leary

Published

2023-07-17

Modified

2025-11-19

Important

510(k) summaries are written by device manufacturers, not the FDA. When the FDA requests edits to a 510(k) summary, it may be as a final step, under time pressure, when the FDA is ready to clear the device but has not yet issued its final decision. The quality and quantity of information varies widely, and even the best 510(k) summaries are summaries: They are not comprehensive.

Therefore, it is easy to draw the wrong conclusions when analyzing 510(k) summaries. For example, 510(k) summaries alone cannot be used to provide reliable estimates of the proportion of 510(k)s that included clinical data, a subgroup analysis, or other types of evidence. For those types of analyses, decision summaries are a better option when they are available, and other sources should also be included, such as device labeling, ClinicalTrials.gov, and scientific publications. Even if you use all of the best available public data, your estimates may be far from the truth.

Introduction

Every year, the FDA clears around 3,000 medical devices to enter the U.S. market through a review pathway known as the 510(k) program. Descriptions of the data and information that formed the basis for many of these decisions is available in 510(k) summaries, which are posted as PDFs on the FDA’s website after a decision is made. This makes it possible to find information about individual 510(k) clearance decisions, but analyzing the information across the 510(k) program over time has been more difficult because it is only made publicly available through tens of thousands of individual files – until now. Here, you’ll find a dataset with the full text contents of more than 90,000 510(k) clearance packages.¹ This includes full embedded and OCR text from over 609,000 pages.² It can be downloaded in CSV format.

If you use this, please cite this page.

Note

The information in this dataset is from sources in the public domain. It is provided here “as-is” without warranty of any kind. You may also want to refer to the FDA’s website.

Dataset description

Note

When originally published, this dataset focused on 510(k) summaries. It now also includes clearance packages for many submissions where a 510(k) statement was used instead of a 510(k) summary. For these submissions, the clearance letter and indications for use are included in this dataset.

I changed the OCR approach in October 2025 to improve quality. OCR text for clearance packages with a date_obtained of October 2025 or later should be a bit better as a result. I’m slowly refreshing the dataset and am prioritizing clearance packages that may have been missed in earlier versions of this dataset and clearance packages where no embedded text was available in earlier versions of this dataset.

Most 510(k) clearance packages include:

A clearance letter from the FDA
An “indications for use” form
A 510(k) summary

For submissions where a “510(k) Statement” is used, only the clearance letter and the indications for use form are present.³

This dataset provides one row per page for each 510(k) clearance package and includes the following fields:

submission_number: The 510(k) or De Novo number for the submission associated with the 510(k) clearance package.
date_obtained: The date the 510(k) clearance package was obtained from the FDA website. The date is formatted according to ISO 8601.
page_number: The PDF page index from which the text was obtained.
text_embedded: The contents of any text embedded in the PDF. This is obtained using pdftools::pdf_text.
text_ocr: The contents of any text found using optical character recognition (OCR). This is from Tesseract`.

Download the data

Click here if you just want to find a predicate device

For finding predicate devices, a site-specific search will probably serve you better than this dataset will. For example, if you are looking for tumor segmentation algorithms:

Google: site:accessdata.fda.gov/cdrh_docs/ “tumor segmentation”
DuckDuckGo: site:accessdata.fda.gov/cdrh_docs/ tumor segmentation
Bing: site:accessdata.fda.gov/cdrh_docs/ tumor segmentation

FDA Regulatory Expertise:
I provide digital health and medical device regulatory strategy solutions to technology developers, healthcare organizations, trade and professional associations, and others. Book a meeting.

The dataset is available as a gzip-compressed CSV file: pmn_summary_text.csv.gz (Size: 279M, MD5: 37ecd600aeb09687c173ec2177f17c1b)

Example of how to access and use the dataset with the R programming language

Here is a sample script in R that downloads and reads the dataset:

# Load libraries -----
library(fs)
library(readr)
library(utils)

# Set up URLs and paths -----
url_download <- 
  "https://www.boleary.com/blog/posts/202307-pmn/data/pmn_summary_text.csv.gz"
path_download <- fs::dir_create("data-raw")
filename_download <- "pmn_summary_text.csv.gz"
filepath_download <- 
  fs::path_expand(paste(path_download, filename_download, sep = "/"))

# Download the data -----
utils::download.file(
  url = url_download,
  destfile = filepath_download
)

# Read in the data -----
# Naming this "pmn_summaries" because "premarket notification (PMN)" is 
# another name for 510(k)s that's a bit easier to use in code. 
pmn_summaries <- 
  readr::read_delim(
    file = filepath_download,
    delim = ";",
    col_types = 
      readr::cols(
        submission_number = readr::col_character(),
        date_obtained = readr::col_date(),
        page_number = readr::col_integer(),
        text_embedded = readr::col_character(),
        text_ocr = readr::col_character()
      )
  )

Expand the code block below for a sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018.

Show the code

# Load and install additional libraries -----
# Install the fdadata package from GitHub if it's missing
if (!require("fdadata")) {
  if (!require("devtools")) install.packages("devtools")
  devtools::install_github("bjoleary/fdadata")
}
library(dplyr)
library(lubridate)
library(stringr)
library(testthat)
library(tidyr)

# Load 510(k) submission metadata and filter to image processing devices -----
submissions_of_interest <- 
  fdadata::pmn |> 
  dplyr::filter(
    # Looking for submissions in product code LLZ for "System, Image 
    # Processing, Radiological"
    .data$product_code == "LLZ",
    # Looking for submissions with a decision date on or after 2008-01-01: 
    .data$date_decision >= lubridate::ymd("2008-01-01"),
    # Looking for submissions with a decision date before 2019-01-01: 
    .data$date_decision < lubridate::ymd("2019-01-01"),
  ) |> 
  # Just keep the submission_number field for this analysis
  dplyr::select("submission_number")

# Filter the pmn_summaries data by joining the submissions_of_interest -----
summaries_to_search <- 
  dplyr::inner_join(
    x = submissions_of_interest, 
    y = pmn_summaries, 
    by = c("submission_number" = "submission_number")
  )

# Set up a search term -----
submission_number_pattern <- 
  stringr::regex(
    # Match the letter "K" followed by exactly 6 numeric digits
    pattern = "K[0-9]{6}",
    # If, instead, you wanted to find both 510(k)s and De Novos, you might 
    # start with a pattern like this: "(K|DEN)[0-9]{6}"
    # Accept either upper- or lower-case "K"s
    ignore_case = TRUE
  )

# Double check that the regular expression search term is behaving as expected
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "Can we find the submission number for K000000?",
      pattern = submission_number_pattern
    ),
  expected = list(c("K000000"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "What if we include a supplement number? K123456/S001",
      pattern = submission_number_pattern
    ),
  expected = list(c("K123456"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "And if it's a lower case K? k180001",
      pattern = submission_number_pattern
    ),
  expected = list(c("k180001"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "This time we will want to see both K123456 and k180001",
      pattern = submission_number_pattern
    ),
  expected = list(c("K123456", "k180001"))
)
testthat::expect_equal(
  object = 
    stringr::str_extract_all(
      string = "We don't expect it to match a q-submission number like Q123456",
      pattern = submission_number_pattern
    ),
  expected = list(character(0L))
)

# Search for submission numbers
search_results <- 
  summaries_to_search |> 
  # For each page, concatenate the embedded and OCR text so we can create one 
  # string where we have the best chance of finding submission numbers 
  # (we'll de-duplicate the numbers we find later)
  tidyr::unite(
    col = "combined_text",
    c("text_embedded", "text_ocr"),
    sep = " ",
    remove = TRUE,
    na.rm = TRUE
  ) |> 
  # Combine all clearance package pages from each submission into a single string
  dplyr::group_by(.data$submission_number) |> 
  dplyr::summarise(
    text = paste(.data$combined_text, collapse = "\\n")
  ) |> 
  # Extract 510(k) submission numbers
  dplyr::mutate(
    submission_referenced =
      stringr::str_extract_all(
        string = .data$text,
        pattern = submission_number_pattern
      ) 
  ) |> 
  # Keep only submission number and results
  dplyr::select(
    "submission_number",
    "submission_referenced"
  ) |> 
  # Make 1 row for each reference found
  tidyr::unnest(cols = c(submission_referenced)) |> 
  # Make sure they are all upper case
  dplyr::mutate(
    submission_referenced = stringr::str_to_upper(.data$submission_referenced)
  ) |> 
  # Remove results where the reference found is the same as the submission 
  # it was found in
  dplyr::filter(.data$submission_number != .data$submission_referenced) |> 
  # Don't double count a reference just because it may have been mentioned 
  # more than once
  dplyr::distinct() |> 
  # Tally it up
  dplyr::group_by(.data$submission_referenced) |> 
  dplyr::tally(name = "references") |> 
  # Put in order of frequency of appearance followed by submission number, 
  # placing more recent submission numbers first
  dplyr::arrange(
    dplyr::desc(.data$references), 
    dplyr::desc(.data$submission_referenced)
  ) |> 
  # Limit to the first five rows
  utils::head(5) |> 
  # Join in some metadata
  dplyr::left_join(
    y = 
      fdadata::pmn |> 
      dplyr::select(
        "submission_number",
        "date_decision",
        "sponsor",
        "device"
      ),
    by = c("submission_referenced" = "submission_number")
  )

This produces Table 1.

Table 1: Five submissions frequently referenced in 510(k) summaries for image processing devices cleared from 2008 - 2018

	Submission Referenced	References	Date Decision	Sponsor	Device
1	K071331	14	2007-05-25	VITAL IMAGES, INC.	VITREA VERSION 4.0
2	K120361	12	2012-04-06	FUJIFILM MEDICAL SYSTEMS USA, INC.	SYNAPSE 3D BASE TOOLS
3	K150843	11	2015-04-24	Siemens AG	syngo.via (version VB10A)
4	K110300	11	2011-07-01	MATERIALISE DENTAL NV	SIMPLANT 2011
5	K073714	11	2008-03-19	ORTHOCRAT, LTD.	TRAUMACAD VERSION 2.0

Additional considerations

PDF Portfolios are not included. A small number of 510(k) summaries are posted as PDF portfolios and may not have been processed correctly or included in this dataset. Based on manual spot-checks, I believe that problems are particularly common when a PDF portfolio includes a fillable version of the indications for use form.
Many 510(k) summaries do not include embedded text. Embedded text is not present in many of the 510(k) clearance packages, particularly for decisions made many years ago. Both embedded text, when available, and text from OCR should be included for each page in this dataset. Which you choose to use and when may depend on your specific use-case.
Many 510(k)s have a 510(k) statement instead of a summary. Not all cleared 510(k)s have 510(k) clearance packages on the FDA website. Some manufacturers use a 510(k) statement in lieu of a 510(k) summary, which means they promise to provide safety and effectiveness information within 30 days of a request from any person.⁴ In addition, the 510(k) Summary/Statement requirement did not exist until the 1990s, so earlier submissions do not have 510(k) clearance packages.⁵
510(k) summaries are not written by the FDA. A 510(k) summary is written by the manufacturer of the device, not by the FDA. Sometimes, the FDA provides considerable input. Other times, the FDA may conduct only a cursory review of a 510(k) summary. Practice has varied over the decades. Sometimes, the manufacturer and the FDA may forget to update the contents of a 510(k) summary at the end of a review after additional information has been provided, and a 510(k) summary may only reflect what was initially provided to the FDA before all questions were resolved. Be cautious about drawing firm conclusions about what was included – or absent – from a 510(k) on the basis of a 510(k) summary.

FDA Regulatory Expertise:
I help technology developers put their best foot forward when they need to interact with the FDA.

Schedule a meeting

Availability of clearance packages and summaries in this dataset

As noted above, the 510(k) summary regulation came into effect in the 1990s. While I believe this dataset includes all or nearly all of the 510(k) summaries the FDA has made available on its website, there are notable gaps. Figure 1 shows the percentage of expected 510(k) summaries that are available in this dataset by the fiscal year of decision.

Figure 1: 510(k) summary availability by fiscal year of decision. Data.

Clearance packages that include decision letters and indications for use statements may also be posted when a 510(k) statement is used, and correction letters may be posted for historical submissions when the FDA changes how it describes, identifies, groups, and tracks devices with product codes [1] or when the FDA makes other changes. Figure 2 shows the percentage of clearance packages that are available in this dataset based on the total number of clearances in each fiscal year.

Figure 2: Clearance package availability by fiscal year of decision. Data.

I don’t know why the availability of clearance packages in this dataset falls off so dramatically in 2000 and 2001, but I’m looking into it.

In addition, 17 submissions associated with clearance packages in this dataset no longer appear in the FDA’s database: K110981, K112204, K112294, K131823, K131831, K152517, K182156, K192179, K210496, K211351, K211727, K212972, K220183, K220185, K221416, K230447, and K232365.

Some of these submissions appear to be for Powdered Surgeon’s Gloves, Powdered Patient Examination Gloves, and Absorbable Powder for Lubricating a Surgeon’s Glove, which were banned under a regulation issued in 2016. [2] The FDA also issued several notifications on data integrity, and this may account for some of these submissions. The FDA does not appear to have published information on its criteria or process for removing submissions from its databases after a clearance decision. Other submissions may have been removed before I was able to access them and add them to this dataset.

Changelog

2025-11-19:
- Added 326 additional 510(k) clearance packages.
2025-10-22:
- Added 673 additional 510(k) clearance packages.
- Added new note about the limitations of this dataset at the top of the post.
- Added new section on the availability of clearance packages and summaries in this dataset.
- Updated post title and graphic.
- I changed the OCR approach in October 2025 to improve quality. OCR text for clearance packages with a date_obtained of October 2025 or later should be a bit better as a result. I’m slowly refreshing the dataset, prioritizing clearance packages that may have been missed in earlier versions of this dataset as well as those where no embedded text is available.
2025-09-10:
- Added additional 510(k) clearance packages.
2025-08-03:
- Added additional 510(k) clearance packages.
- Revised disclaimer.
2025-06-10:
- Added additional 510(k) clearance packages.
2025-05-22:
- Added additional 510(k) clearance packages.
2025-05-14:
- Added additional 510(k) clearance packages.
2025-04-15:
- Added additional 510(k) clearance packages.
2025-03-16:
- Added additional 510(k) clearance packages.
2025-02-15:
- Removed “Known Issues” because I have not been keeping it up-to-date.
- Updated ad.
- Added additional 510(k) clearance packages.
2025-01-22:
- Added additional 510(k) clearance packages.
2025-01-04:
- Added additional 510(k) clearance packages.
2024-12-18:
- Added additional 510(k) clearance packages.
2024-11-07:
- Added additional 510(k) clearance packages.
2024-10-17:
- Added additional 510(k) clearance packages.
2024-10-08:
- Removed parquet file format option.
- Added additional 510(k) clearance packages.
2024-09-03:
- Added additional 510(k) clearance packages.
2024-08-19:
- Added additional 510(k) clearance packages.
2024-08-12:
- Added additional 510(k) clearance packages.
2024-08-06:
- Added additional 510(k) clearance packages.
2024-07-22:
- Added additional 510(k) clearance packages.
2024-07-08:
- Added additional 510(k) clearance packages.
- Changed parquet compression from gzip to snappy.
2024-06-27:
- Added additional 510(k) clearance packages.
2024-06-05:
- Added clearance packages for 510(k)s with 510(k) Statements instead of summaries. These include the clearance letter and the indications for use and do not include a 510(k) summary.
2024-05-10:
- Added additional 510(k) summaries.
2024-02-29:
- Added additional 510(k) summaries.
2023-12-23:
- Added additional 510(k) summaries.
2023-10-25:
- Added additional 510(k) summaries.
2023-08-14:
- Fixed an error in the sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018. After submission numbers are extracted, they are now all made upper case using stringr::str_to_upper() before they are counted. Before, for example, “K100001” and “k100001” would have been counted as different submissions because of the difference in case for the “K”. This fix did not change the results presented in Table 1.
- Fixed a spelling mistake in a footnote.
- Added Known issues section.
- Clarified that De Novo reclassification orders are treated as 510(k) summaries for the purposes of this dataset.
- Added new submissions recently posted to the FDA website.
- Added old submissions that were previously missing from the dataset. I believe the dataset is now comprehensive as of 2023-08-14.
2023-07-17: Initial publication.

References

[1]

“Medical device classification product codes,” U.S. Food and Drug Administration, Final Guidance Document. Docket Number FDA-2011-D-0916, Apr. 2013. Available: https://www.regulations.gov/document/FDA-2011-D-0916-0009

[2]

“81 FR 91722: Banned devices; powdered surgeon’s gloves, powdered patient examination gloves, and absorbable powder for lubricating a surgeon’s glove,” U.S. Food and Drug Administration, Final Rule. Federal Register Notice 81 FR 91722, Dec. 2016. Available: https://www.federalregister.gov/d/2016-30382

Footnotes

Upon initial publication, this dataset included more than 72,000 510(k) summary packages. This includes De Novos, where the reclassification order was used as the summary. Earlier versions of this dataset did not include clearance packages for 510(k)s where a 510(k) Statement was used.↩︎
Upon initial publication, this dataset included more than 494,000 pages.↩︎
See: https://www.fda.gov/medical-devices/premarket-notification-510k/content-510k#link_7.↩︎
See the FDA’s description of the necessary Content of a 510(k), which describes this in more depth.↩︎
The requirement for a 510(k) summary or a 510(k) statement is from the Safe Medical Devices Act (SMDA) of 1990. The regulation, 21 CFR 807.92, was established through an interim rule with 57 FR 18066 on April 28, 1992 and was finalized with 59 FR 64295 on December 14, 1994.↩︎

Reuse

CC BY 4.0

Citation

BibTeX citation:

@online{o'leary,
  author = {O’Leary, Brendan},
  title = {Data for Researchers: {Extracted} Text from More Than 89,000
    {FDA} Medical Device 510(k) Clearance Packages},
  date = {},
  url = {https://www.boleary.com/blog/posts/202307-pmn/},
  langid = {en}
}

For attribution, please cite this work as:

B. O’Leary, “Data for researchers: Extracted text from more than 89,000 FDA medical device 510(k) clearance packages.” https://www.boleary.com/blog/posts/202307-pmn/