# Load libraries -----
library(fs)
library(readr)
library(utils)
# Set up URLs and paths -----
<-
url_download "https://www.boleary.com/blog/posts/202307-pmn/data/pmn_summary_text.csv.gz"
<- fs::dir_create("data-raw")
path_download <- "pmn_summary_text.csv.gz"
filename_download <-
filepath_download ::path_expand(paste(path_download, filename_download, sep = "/"))
fs
# Download the data -----
::download.file(
utilsurl = url_download,
destfile = filepath_download
)
# Read in the data -----
# Naming this "pmn_summaries" because "premarket notification (PMN)" is
# another name for 510(k)s that's a bit easier to use in code.
<-
pmn_summaries ::read_delim(
readrfile = filepath_download,
delim = ";",
col_types =
::cols(
readrsubmission_number = readr::col_character(),
date_obtained = readr::col_date(),
page_number = readr::col_integer(),
text_embedded = readr::col_character(),
text_ocr = readr::col_character()
) )
Introduction
Every year, the FDA clears around 3,000 medical devices to enter the U.S. market through a review pathway known as the 510(k) program. Descriptions of the data and information that formed the basis for many of these decisions is available in 510(k) summaries, which are posted as PDFs on the FDA’s website after a decision is made. This makes it possible to find information about individual 510(k) clearance decisions, but analyzing the information across the 510(k) program over time has been more difficult because it is only made publicly available through tens of thousands of individual files – until now. Here, you’ll find a dataset with the full text contents of more than 86,000 510(k) clearance packages.1 This includes full embedded and OCR text from over 573,000 pages.2 It can be downloaded in CSV format.
If you use this, please cite this page.
The information in this dataset is from sources in the public domain. It is provided here “as-is” without warranty of any kind. For the most accurate and up-to-date information, always refer to the FDA website.
Dataset description
When originally published, this dataset focused on 510(k) summaries. It now also includes clearance packages for many submissions where a 510(k) statement was used instead of a 510(k) summary. For these submissions, the clearance letter and indications for use are included in this dataset.
Most 510(k) clearance packages include:
A clearance letter from the FDA
An “indications for use” form
A 510(k) summary
For submissions where a “510(k) Statement” is used, only the clearance letter and the indications for use form are present.3
This dataset provides one row per page for each 510(k) clearance package and includes the following fields:
submission_number
- The 510(k) or De Novo number for the submission associated with the 510(k) clearance package.
date_obtained
- The date the 510(k) clearance package was obtained from the FDA website. The date is formatted according to ISO 8601.
page_number
- The PDF page index from which the text was obtained.
text_embedded
-
The contents of any text embedded in the PDF. This is obtained using
pdftools::pdf_text
. text_ocr
-
The contents of any text found using optical character recognition (OCR). This is from Tesseract via
pdftools::pdf_ocr_text
.
Download the data
For finding predicate devices, a site-specific search will probably serve you better than this dataset will. For example, if you are looking for tumor segmentation algorithms:
I provide digital health and medical device regulatory strategy solutions to technology developers, healthcare organizations, trade and professional associations, and others. Book a meeting.
The dataset is available as a gzip-compressed CSV file: pmn_summary_text.csv.gz (Size: 281M, MD5: 362d5a3bd016a2a231adbc52f86125f9)
Example of how to access and use the dataset with the R programming language
Here is a sample script in R that downloads and reads the dataset:
Expand the code block below for a sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018.
Show the code
# Load and install additional libraries -----
# Install the fdadata package from GitHub if it's missing
if (!require("fdadata")) {
if (!require("devtools")) install.packages("devtools")
::install_github("bjoleary/fdadata")
devtools
}library(dplyr)
library(lubridate)
library(stringr)
library(testthat)
library(tidyr)
# Load 510(k) submission metadata and filter to image processing devices -----
<-
submissions_of_interest ::pmn |>
fdadata::filter(
dplyr# Looking for submissions in product code LLZ for "System, Image
# Processing, Radiological"
$product_code == "LLZ",
.data# Looking for submissions with a decision date on or after 2008-01-01:
$date_decision >= lubridate::ymd("2008-01-01"),
.data# Looking for submissions with a decision date before 2019-01-01:
$date_decision < lubridate::ymd("2019-01-01"),
.data|>
) # Just keep the submission_number field for this analysis
::select("submission_number")
dplyr
# Filter the pmn_summaries data by joining the submissions_of_interest -----
<-
summaries_to_search ::inner_join(
dplyrx = submissions_of_interest,
y = pmn_summaries,
by = c("submission_number" = "submission_number")
)
# Set up a search term -----
<-
submission_number_pattern ::regex(
stringr# Match the letter "K" followed by exactly 6 numeric digits
pattern = "K[0-9]{6}",
# If, instead, you wanted to find both 510(k)s and De Novos, you might
# start with a pattern like this: "(K|DEN)[0-9]{6}"
# Accept either upper- or lower-case "K"s
ignore_case = TRUE
)
# Double check that the regular expression search term is behaving as expected
::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "Can we find the submission number for K000000?",
pattern = submission_number_pattern
),expected = list(c("K000000"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "What if we include a supplement number? K123456/S001",
pattern = submission_number_pattern
),expected = list(c("K123456"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "And if it's a lower case K? k180001",
pattern = submission_number_pattern
),expected = list(c("k180001"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "This time we will want to see both K123456 and k180001",
pattern = submission_number_pattern
),expected = list(c("K123456", "k180001"))
)::expect_equal(
testthatobject =
::str_extract_all(
stringrstring = "We don't expect it to match a q-submission number like Q123456",
pattern = submission_number_pattern
),expected = list(character(0L))
)
# Search for submission numbers
<-
search_results |>
summaries_to_search # For each page, concatenate the embedded and OCR text so we can create one
# string where we have the best chance of finding submission numbers
# (we'll de-duplicate the numbers we find later)
::unite(
tidyrcol = "combined_text",
c("text_embedded", "text_ocr"),
sep = " ",
remove = TRUE,
na.rm = TRUE
|>
) # Combine all clearance package pages from each submission into a single string
::group_by(.data$submission_number) |>
dplyr::summarise(
dplyrtext = paste(.data$combined_text, collapse = "\\n")
|>
) # Extract 510(k) submission numbers
::mutate(
dplyrsubmission_referenced =
::str_extract_all(
stringrstring = .data$text,
pattern = submission_number_pattern
) |>
) # Keep only submission number and results
::select(
dplyr"submission_number",
"submission_referenced"
|>
) # Make 1 row for each reference found
::unnest(cols = c(submission_referenced)) |>
tidyr# Make sure they are all upper case
::mutate(
dplyrsubmission_referenced = stringr::str_to_upper(.data$submission_referenced)
|>
) # Remove results where the reference found is the same as the submission
# it was found in
::filter(.data$submission_number != .data$submission_referenced) |>
dplyr# Don't double count a reference just because it may have been mentioned
# more than once
::distinct() |>
dplyr# Tally it up
::group_by(.data$submission_referenced) |>
dplyr::tally(name = "references") |>
dplyr# Put in order of frequency of appearance followed by submission number,
# placing more recent submission numbers first
::arrange(
dplyr::desc(.data$references),
dplyr::desc(.data$submission_referenced)
dplyr|>
) # Limit to the first five rows
::head(5) |>
utils# Join in some metadata
::left_join(
dplyry =
::pmn |>
fdadata::select(
dplyr"submission_number",
"date_decision",
"sponsor",
"device"
),by = c("submission_referenced" = "submission_number")
)
This produces Table 1.
Submission Referenced | References | Date Decision | Sponsor | Device | |
---|---|---|---|---|---|
1 | K071331 | 16 | 2007-05-25 | VITAL IMAGES, INC. | VITREA VERSION 4.0 |
2 | K120361 | 12 | 2012-04-06 | FUJIFILM MEDICAL SYSTEMS USA, INC. | SYNAPSE 3D BASE TOOLS |
3 | K073714 | 12 | 2008-03-19 | ORTHOCRAT, LTD. | TRAUMACAD VERSION 2.0 |
4 | K150843 | 11 | 2015-04-24 | Siemens AG | syngo.via (version VB10A) |
5 | K110300 | 11 | 2011-07-01 | MATERIALISE DENTAL NV | SIMPLANT 2011 |
Additional considerations
PDF Portfolios are not included. A small number of 510(k) summaries are posted as PDF portfolios and may not have been processed correctly or included in this dataset. Based on manual spot-checks, I believe that problems are particularly common when a PDF portfolio includes a fillable version of the indications for use form.
Many 510(k) summaries do not include embedded text. Embedded text is not present in many of the 510(k) clearance packages, particularly for decisions made many years ago. Both embedded text, when available, and text from OCR should be included for each page in this dataset. Which you choose to use and when may depend on your specific use-case.
Many 510(k)s have a 510(k) statement instead of a summary. Not all cleared 510(k)s have 510(k) clearance packages on the FDA website. Some manufacturers use a 510(k) statement in lieu of a 510(k) summary, which means they promise to provide safety and effectiveness information within 30 days of a request from any person.4 In addition, the 510(k) Summary/Statement requirement did not exist until the 1990s, so earlier submissions do not have 510(k) clearance packages.5
510(k) summaries are not written by the FDA. A 510(k) summary is written by the manufacturer of the device, not by the FDA. Sometimes, the FDA provides considerable input. Other times, the FDA may conduct only a cursory review of a 510(k) summary. Practice has varied over the decades. Sometimes, the manufacturer and the FDA may forget to update the contents of a 510(k) summary at the end of a review after additional information has been provided, and a 510(k) summary may only reflect what was initially provided to the FDA before all questions were resolved. Be cautious about drawing firm conclusions about what was included – or absent – from a 510(k) on the basis of a 510(k) summary.
If you would like help mining this dataset or determining the best regulatory strategy for your product, I’m available as a consultant through NDA Partners or directly (message me on LinkedIn).
Known issues
Submission Number | Issue | Date Checked | Status | |
---|---|---|---|---|
1 | K050151 | Empty summary | 2023-08-14 | Not resolved |
2 | K222386 | Wrong submission | 2023-08-14 | Not resolved |
3 | K221515 | Wrong submission | 2023-08-14 | Not resolved |
4 | K211740 | Wrong submission | 2023-08-14 | Not resolved |
5 | K202565 | Wrong submission | 2023-08-14 | Not resolved |
6 | K190916 | Wrong submission | 2023-08-14 | Not resolved |
7 | K190027 | Wrong submission | 2023-08-14 | Not resolved |
8 | K170825 | Wrong submission | 2023-08-14 | Not resolved |
9 | K162044 | Wrong submission | 2023-08-14 | Not resolved |
10 | K900070 | Not a 510(k) summary (Complete submission) | 2023-08-14 | Not resolved |
11 | K030515 | Corrupt PDF | 2023-08-14 | Not resolved |
12 | K160695 | Corrupt PDF | 2023-08-14 | Not resolved |
13 | K173946 | Corrupt PDF | 2023-08-14 | Not resolved |
14 | K181029 | Corrupt PDF | 2023-08-14 | Not resolved |
15 | K192198 | Corrupt PDF | 2023-08-14 | Not resolved |
16 | K202408 | Corrupt PDF | 2023-08-14 | Not resolved |
17 | K210112 | Corrupt PDF | 2023-08-14 | Not resolved |
18 | K221619 | Corrupt PDF | 2023-08-14 | Not resolved |
19 | K210801 | Not a 510(k) summary (Decision summary) | 2023-08-14 | Not resolved |
20 | K993307 | Missing pages | 2023-08-14 | Not resolved |
21 | K220672 | Empty summary | 2023-08-14 | Not resolved |
Thanks to Jake W. for identifying many of these.
Changelog
- 2024-12-18:
- Added additional 510(k) clearance packages.
- 2024-11-07:
- Added additional 510(k) clearance packages.
- 2024-10-17:
- Added additional 510(k) clearance packages.
- 2024-10-08:
- Removed parquet file format option.
- Added additional 510(k) clearance packages.
- 2024-09-03:
- Added additional 510(k) clearance packages.
- 2024-08-19:
- Added additional 510(k) clearance packages.
- 2024-08-12:
- Added additional 510(k) clearance packages.
- 2024-08-06:
- Added additional 510(k) clearance packages.
- 2024-07-22:
- Added additional 510(k) clearance packages.
- 2024-07-08:
- Added additional 510(k) clearance packages.
- Changed parquet compression from gzip to snappy.
- 2024-06-27:
- Added additional 510(k) clearance packages.
- 2024-06-05:
- Added clearance packages for 510(k)s with 510(k) Statements instead of summaries. These include the clearance letter and the indications for use and do not include a 510(k) summary.
- 2024-05-10:
- Added additional 510(k) summaries.
- 2024-02-29:
- Added additional 510(k) summaries.
- 2023-12-23:
- Added additional 510(k) summaries.
- 2023-10-25:
- Added additional 510(k) summaries.
- 2023-08-14:
- Fixed an error in the sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018. After submission numbers are extracted, they are now all made upper case using
stringr::str_to_upper()
before they are counted. Before, for example, “K100001” and “k100001” would have been counted as different submissions because of the difference in case for the “K”. This fix did not change the results presented in Table 1. - Fixed a spelling mistake in a footnote.
- Added Known issues section.
- Clarified that De Novo reclassification orders are treated as 510(k) summaries for the purposes of this dataset.
- Added new submissions recently posted to the FDA website.
- Added old submissions that were previously missing from the dataset. I believe the dataset is now comprehensive as of 2023-08-14.
- Fixed an error in the sample script in R that identifies the 510(k)s that were referenced in the most 510(k) summaries for radiological image processing devices cleared between calendar years 2008 and 2018. After submission numbers are extracted, they are now all made upper case using
- 2023-07-17: Initial publication.
Footnotes
Upon initial publication, this dataset included more than 72,000 510(k) summary packages. This includes De Novos, where the reclassification order was used as the summary. Earlier versions of this dataset did not include clearance packages for 510(k)s where a 510(k) Statement was used.↩︎
Upon initial publication, this dataset included more than 494,000 pages.↩︎
See: https://www.fda.gov/medical-devices/premarket-notification-510k/content-510k#link_7.↩︎
See the FDA’s description of the necessary Content of a 510(k), which describes this in more depth.↩︎
The requirement for a 510(k) summary or a 510(k) statement is from the Safe Medical Devices Act (SMDA) of 1990. The regulation, 21 CFR 807.92, was established through an interim rule with 57 FR 18066 on April 28, 1992 and was finalized with 59 FR 64295 on December 14, 1994.↩︎
Reuse
Citation
@online{o'leary2023,
author = {O’Leary, Brendan},
title = {Data for Researchers: {Extracted} Text from More Than 72,000
{FDA} Medical Device 510(k) Summaries},
date = {2023-07-17},
url = {https://www.boleary.com/blog/posts/202307-pmn/},
langid = {en}
}