Introduction to Amazon Athena
You will notice that I talk a lot about marketing and healthcare data on the website. I talk about it because I do the same thing to market my own services and for my healthcare customers every single day as well.
The best and worst part about marketing in healthcare is that healthcare data is public. It is all available for you to download from CMS. BUT, it is notoriously unreliable, overwhelming in nature, not organized in a manner most healthcare folks would understand.
And most of this data is available in CSV (comma separated values) or spreadsheet formats. YAY! However, most of these are in multi gigabytes. YIKES!
Traditionally you would need very experienced IT engineers to wrangle that data sitting together with experienced healthcare marketing teams to get you what you needed. But no more.
Well.. that’s precisely what Amazon Athena does for you. Amazon / AWS S3 has been around for ages and the cheapest, most reliable, HIPAA secure data storage option you will find in the market.
Amazon Athena lets you start querying data instantly. Get results in seconds. Pay only for the queries you run. Dirt cheap. You pay for ONLY what you use.
We LOVE LOVE LOVE it.
Serverless, no ETL (extract, Transform, Load)
Before Amazon Athena, your IT team would have to download those CSV files, extract, transform and load those multi gigabyte files into SQL servers, tune those darn servers, write queries for you to run etc etc..
Amazon Athena changes all that.
No longer do you have to have a dedicated IT team, dedicated hardware, install software or anything of that sort. Athena is serverless. You can quickly query your data without having to setup and manage any servers or data warehouses. Just point to your data in Amazon S3, define the schema, and start querying using the built-in query editor. Amazon Athena allows you to tap into all your data in S3 without the need to set up complex processes to extract, transform, and load the data (ETL).
Did I say dirt cheap?
With Amazon Athena, you pay only for the queries that you run. That means, all that multi, multi gigabytes of data files that you download from CMS and keep around because you are going to find some data from there before you go for your next hospital meeting?
You only get charged mere cents for storing it in AWS S3. You didn’t query it. You don’t pay for it.
So, what are you charged?
You are charged $5 per terabyte scanned by your queries. You can save from 30% to 90% on your per-query costs and get better performance by compressing, partitioning, and converting your data into columnar formats. Athena queries data directly in Amazon S3. There are no additional storage charges beyond S3..
Is Amazon Athena Fast?
Out of the box, Amazon Athena is fast. Query editor v2 is faster. Plus, if you convert your data into columnar format (a simple CTAS command), your queries are even faster (we mean BLAZING fast). You get fast interactive performance even for large datasets!
With Amazon Athena, you don’t have to worry about having enough compute resources to get fast, interactive query performance. Amazon Athena automatically executes queries in parallel, so most results come back within seconds.
Amazon Athena uses Presto with ANSI SQL support and works with a variety of standard data formats, including CSV, JSON, ORC, Avro, and Parquet. Athena is ideal for quick, ad-hoc querying but it can also handle complex analysis, including large joins, window functions, and arrays. Amazon Athena is highly available; and executes queries using compute resources across multiple facilities and multiple devices in each facility. Amazon Athena uses Amazon S3 as its underlying data store, making your data highly available and durable.
What do I use Amazon Athena for?
Every day, you have to make business decisions. You can only make correct decisions when you have the right information, on demand.
Where do we start? We market to healthcare businesses each day. Getting a hold of CMS database for healthcare businesses across the USA is easy. Go to CMS/NPPES website and download it.
Query it to your heart’s content with Amazon Athena (examples below).
PECOS and the NPI registry. In general, PECOS is Medicare’s enrollment and revalidation system. It is the sole verified source of Medicare data. We use the Physician Compare database as well. The underlying data source for the physician compare database is PECOS.
Download that and query it. Go for your next meeting, fully prepared.
Healthcare referrals pattern published for those 5 years? Download the data and fire your queries.
The limits are endless.
Of course, you can also use EzHCRM for all these purposes (prebuilt filters).
How about sensitive information? Are they stored in the cloud?
If you are a healthcare customer, you’d be interested to know that Amazon Athena can now be used to process data containing PHI. Encryption of data while in transit between Amazon Athena and S3 is provided by default using SSL/TLS. Encryption of PHI while at-rest on S3 should be performed according to the guidance provided in the S3 section (Amazon S3 itself is HIPAA eligible anyways). Encryption of query results from and within Amazon Athena, including staged results, should be enabled using server-side encryption with Amazon S3 managed keys (SSE-S3), AWS KMS-managed keys (SSE-KMS) or client-side encryption with AWS KMS-managed keys (CSE-KMS). Amazon Athena uses AWS CloudTrail to log all API calls.
Read more on AWS website.
How to find referring physicians for your business with Amazon Athena
Physician liaisons are tasked with introducing their employer practice to neighboring physicians. They are measured on how many new patient referrals the practice is getting each week/month/quarter. But, most of them are not equipped with the right tools to be able to do their job.
Of course, if you use our physician healthcare CRM, this task would be easy to do. However, you don’t necessarily need to, if you have your own IT team. Following are the steps you can take.
You are going to consolidate quite a few sources of information (publicly available from CMS). The first one is the NPI database, available from CMS Data dissemination page. Note that this data (zipped) contains all the practitioner information across the USA.
As per CMS website, The zipped NPPES Downloadable file will include the following 3 reference files:
- Other Name Reference File – this file contains additional Other Names associated with Type 2 NPIs
- Practice Location Reference File – this file contains all of the non-primary Practice Locations associated with Type 1 and Type 2 NPIs
- Endpoint Reference File – this file contains all Endpoints associated with Type 1 and Type 2 NPIs.
This is in addition to the “Full Replacement Monthly NPI File”. In addition to this, you can also get the “Weekly Incremental NPI Files”. These files are delivered in CSV format and you can download them, then extract them to your system (WARNING – these are pretty large files).
We recommend taking one of these two approaches of loading up and exploring this data.
We have already explained how to process this data using Amazon RDS in this blog about “how to use CMS data for medical marketing”. The following information is for Amazon Athena.
Step 1 – Download and extract NPI data
You will get the necessary files from NPI files link here. Extract each file. They should all be CSV files once you extract them (that’s how CMS distributes them).
Step 2 – Add these files to AWS S3
AWS S3 allows you to store enormous amounts of data at dirt cheap prices. We use it quite heavily (encrypted, fully secured, of course). This data set that you are downloading, however, is public data and you don’t have to worry about encryption and HIPAA related headaches as much.
Create folders in S3 to hold your NPI data. Make sure that you have one directory per large file (it truly does make it easier to manage). E.g. see below
Step 3 – Add a crawler using AWS Glue
Read up about AWS Glue on their website. “AWS Glue is a serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration so that you can start analyzing your data and putting it to use in minutes instead of months.”
We use AWS Glue for “Automatic schema discovery”. This allows us to not waste time in loading, extracting, transforming the large healthcare data files that we typically have to work with.
Initially, you will run it “On demand” because you want to test how things are working. Go ahead and do so, then, you can run this crawler on a schedule (matching the schedule where you download the NPI data files – weekly and/or monthly).
Step 4 – Add a table using AWS Athena
Read up about AWS Athena on their website. “Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.
Athena is easy to use. Simply point to your data in Amazon S3, define the schema, and start querying using standard SQL.”
That’s pretty much exactly what you are going to do anyway. You have uploaded the file to Amazon S3 and now you are going to query it using Amazon Athena.
Notice that once AWS Glue crawler ran, it created a table for you already. You save a lot of money when you use Athena’s Parquet format (columnar vs flat). Go ahead and use the Athena’s CREATE TABLE AS (CTAS) query to create a parquet table for you. Here’s the syntax (from documentation)
CREATE TABLE new_table WITH ( format = 'Parquet', parquet_compression = 'SNAPPY') AS SELECT * FROM old_table;
Step 5 – Query your NPI table using AWS Athena
You can run a quick test (maybe for your own doctor) using Athena query console. E.g. let’s say you want to market to pediatrics providers
select np.NPI , Healthcare_Provider_Taxonomy_Code_1 as code , t.Classification , COALESCE(NULLIF(np.Provider_Other_Organization_Name, ''), np.Provider_Organization_Name_Legal_Business_Name) AS Organization_Name , Provider_Business_Practice_Location_Address_State_Name as state , Provider_Business_Practice_Location_Address_Telephone_Number as phone , Provider_Business_Practice_Location_Address_Fax_Number as fax , Authorized_Official_Last_Name AS ao_lname , Authorized_Official_First_Name AS ao_fname , Authorized_Official_Title_or_Position as AO_Title , Authorized_Official_Telephone_Number as ao_phone from npi_data_ctas_parquet np left join taxonomies t on np.Healthcare_Provider_Taxonomy_Code_1 = t.Code where np.Provider_Business_Practice_Location_Address_State_Name = 'NY' AND (Healthcare_Provider_Taxonomy_Code_1 = '208000000X' OR Healthcare_Provider_Taxonomy_Code_1 = '2080A0000X') limit 100;
You will get results back like thus
Which, you can save as a spreadsheet if you’d like.
How to find all doctors in the USA
Keep in mind that CMS data is VERY dirty.
However, there are several ways of enhancing the CMS data that you have uploaded using S3 and Athena.
Now that you have the table ready to go, you can make your life a little easier with Amazon Athena Views.
So, you can filter out all the medical doctors (i.e. individuals) from the practices / organizations with something as simple as this.
So, that gives you a view with only the doctors in the USA. You can query it as per your liking.
Again, feel free to download the information as a CSV for your healthcare CRM or wherever else you are using it.
If you use our physician healthcare CRM, this data is available for you to use.
How to find all healthcare businesses in the USA
So, you can filter out all the practices / organizations from the medical doctors (i.e. individuals) from the same dataset with something as simple as this.
Create a view and then query it however you want to.
You can filter out however you want to.
In all probability, you are not getting referrals from the dialysis centers 🙂 Just kidding.
Let’s narrow it down to all all referring providers that are relevant to you.
How to find referring practices near your practice locations
Let’s see what you get in the Bronx (as an example) by running a query like this below.
How about we narrow this list down to zip codes of your practice locations? What are the zip codes of your practice locations? Pick one or more of your zip codes.
Then, let’s run this query (example zip codes used here = 10451)
Great, you get some pretty good results (see below)
If you use our physician healthcare CRM, this data is available for you to use.
OK, so that’s interesting. Now, how about we get more specific about which kinds of providers we want to market to? Let’s run a query to find practices and their specialties, then limit them to containing “Medicine”. Keep in mind that you could also have queried using the taxonomy codes directly as well (much more precise).
How to find practices that are associated with hospitals or owned by other practices
You and your physician liaisons know very well that in some cases, all decisions are made by the “corporate locations” and sometimes decisions are decentralized and made by the “sites”. It is a good idea to have an idea about the organization structure before you market to a practice.
For this, let’s look at the following query – specifically, parent_organization_lbn
At least now you can decide whether you want to market top down or bottom up.
Let’s look at the independent practices around 10451 zip code
Interesting – didn’t expect so many of these practices to be independent.
Let’s find all practices that are owned by the big guns
Here’s what the output looks like.
Another way of looking at this same data would be to run this query
How you would have done this before Amazon Athena (a comparison)
Download the NPI database (NPPES Downloadable File)
Go to CMS data dissemination website here. Find this link to download the data (NPI_Files).
Keep in mind that the data is large. Moreover, when you expand the zip file, it will become even larger, so be prepared for that.
Once you download and extract the zip file, you will have these files (keep in mind that this can change at any point)
One thing to keep in mind before proceeding is that NPI database is not really that great. A lot of the information is old and has not been updated in years (even though CMS asks providers to keep their information up to date). In addition to that, you will notice that providers are not the best at clean, error free data entry either.
Deal with it – this is one of the best shots you have got (there’s another way, of course, wherein you can get even better information.. from payers’ member directories, but a tad more technically advanced).
Import this NPI data into a database
This is a bit of a tough one. For this, we would recommend that you download the free MySQL database. After this, create a schema e.g.
npi-20050523-20200607 /*!40100 DEFAULT CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci */;
Once you do that, just copy paste the following SQL statements to create the tables (otherwise, it is a bit of a nightmare). Create the npidata table like this (Download the txt file from here and then save it with a .sql extension)
Now that you have created the table, then, create the Other Name Reference table like this (this table contains the “other names” that a practice does business as). Download the file from here
Next, create the practice locations table (this gives you all the practice locations of a company that’s not listed as the main location. Download from here.
Finally, create the endpoint (email address etc) table. Download the file from here.
Load the data into each one of these tables like this (change it to the file names you have)
LOAD DATA INFILE 'NPPES_Data_Dissemination_June_2020/pl_pfile_20050523-20200607.csv' INTO TABLE pl FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
LOAD DATA LOCAL INFILE 'NPPES_Data_Dissemination_June_2020/endpoint_pfile_20050523-20200607.csv' INTO TABLE endpoint_pfile FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
LOAD DATA LOCAL INFILE 'NPPES_Data_Dissemination_June_2020/othername_pfile_20050523-20200607.csv' INTO TABLE othername FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
LOAD DATA LOCAL INFILE 'NPPES_Data_Dissemination_June_2020/npidata_pfile_20050523-20200607.csv' INTO TABLE npidata FIELDS TERMINATED BY ',' ENCLOSED BY '"' LINES TERMINATED BY '\n' IGNORE 1 ROWS;
How to find providers to market to
First, read this file NPPES_Data_Dissemination_Readme.pdf .. This gives you a LOT of information and helps you understand how to search the data you just imported.
Understand that if you want to market to providers directly, you will need to search for Entity_Type_Code = 1. Keep in mind that while there will be a phone number, the fax number might not be present. Also, you need to look at when the provider last updated their own CMS data. Based on when the provider last updated their data, what you are looking at, might be quite old (still does not mean that it is not valid).
Let’s say you want to find all the providers in Queens, NY. You could very well find out all the providers in NYS first like this
select * from npidata where<br>Provider_Business_Practice_Location_Address_State_Name like 'NY'<br>and Entity_Type_Code = 1<br>order by Provider_Business_Practice_Location_Address_City_Name;
Let’s see how many providers practice in NYS first
select count(*) from npidata where Provider_Business_Practice_Location_Address_State_Name like 'NY' and Entity_Type_Code = 1;
You will find that
count = 415,960
After this, you can very easily download all the results as a CSV file and work further on this.
Here’s what you would get if you were to download this as a CSV file. You can import this into google sheets and work on it further for your marketing goals.
How to find providers of a certain specialist to market to?
For this, you are going to need a little bit more data – on Taxonomy codes. Head over to this page from NUCC and download the latest CSV. After that, you can import this CSV straight away into your database OR, you can simply import this into Google Sheets (or open it in Microsoft excel).
You are going to need to filter the data a little bit for the specialist you are looking for. E.g let’s say that you are looking to market to all dental providers, you can filter and use these codes
- 125K00000X – Dental Providers, Advanced Practice Dental Therapist
- 126800000X – Dental Providers, Dental Assistant
- 124Q00000X – Dental Providers, Dental Hygienist
- 292200000X – Laboratories, Dental Laboratory
- 126900000X- Dental Providers, Dental Laboratory Technician
- 125J00000X – Dental Providers, Dental Therapist
- 122300000X – Dental Providers, Dentist
- 1223D0001X – Dental Providers, Dentist, Dental Public Health
- 1223D0004X – Dental Providers, Dentist, Dentist Anesthesiologist
- 1223E0200X – Dental Providers, Dentist, Endodontics
- 1223G0001X – Dental Providers, Dentist, General Practice
- 1223P0106X – Dental Providers, Dentist, Oral and Maxillofacial Pathology
- 1223P0221X – Dental Providers, Dentist, Pediatric Dentistry
- 1223P0300X – Dental Providers, Dentist, Periodontics
- 1223P0700X – Dental Providers, Dentist, Prosthodontics
- 1223S0112X – Dental Providers, Dentist, Oral and Maxillofacial Surgery
- 1223X0008X – Dental Providers, Dentist, Oral and Maxillofacial Radiology
- 1223X0400X – Dental Providers, Dentist, Orthodontics and Dentofacial Orthopedics
- 1223X2210X – Dental Providers, Dentist, Orofacial Pain
- 122400000X – Dental Providers, Denturist
Filter specialists you want to market to using taxonomy code
Now that you have the taxonomy codes, you can filter your provider data based on that taxonomy code. That’s how you would create a list of all providers serving a specialty to market to. Here’s a sample SQL to run:
select * from npidata where Provider_Business_Practice_Location_Address_State_Name like 'NY' and Entity_Type_Code = 1 and Healthcare_Provider_Taxonomy_Code_1 in ('125K00000X', '126800000X', '124Q00000X', '292200000X', '126900000X', '125J00000X', '122300000X', '1223D0001X', '1223D0004X', '1223E0200X', '1223G0001X', '1223P0106X', '1223P0221X', '1223P0300X', '1223P0700X', '1223S0112X', '1223X0008X', '1223X0400X', '1223X2210X','122400000X') order by Provider_Business_Practice_Location_Address_City_Name;
How to find practices/clinics to market to (not providers)
If you are marketing to hospitals, you will need Entity_Type_Code=2.So, your first step is to get the taxonomy codes from your taxonomy table. e.g
Code, Grouping, Classification, Specialization
- 273100000X, Hospital Units, Epilepsy Unit
- 275N00000X, Hospital Units, Medicare Defined Swing Bed Unit
- 273R00000X, Hospital Units, Psychiatric Unit
- 273Y00000X, Hospital Units, Rehabilitation Unit
- 276400000X, Hospital Units, Rehabilitation, Substance Use Disorder Unit
- 287300000X, Hospitals, Christian Science Sanitorium
- 281P00000X, Hospitals, Chronic Disease Hospital
- 281PC2000X, Hospitals, Chronic Disease Hospital, Children
- 282N00000X, Hospitals, General Acute Care Hospital
- 282NC0060X, Hospitals, General Acute Care Hospital, Critical Access
- 282NC2000X, Hospitals, General Acute Care Hospital, Children
- 282NR1301X, Hospitals, General Acute Care Hospital, Rural
- 282NW0100X, Hospitals, General Acute Care Hospital, Women
- 282E00000X, Hospitals, Long Term Care Hospital
- 286500000X, Hospitals, Military Hospital
- 2865C1500X, Hospitals, Military Hospital, Community Health
- 2865M2000X, Hospitals, Military Hospital, Military General Acute Care Hospital
- 2865X1600X, Hospitals, Military Hospital, Military General Acute Care Hospital. Operational (Transportable)
- 283Q00000X, Hospitals, Psychiatric Hospital
- 283X00000X, Hospitals, Rehabilitation Hospital
- 283XC2000X, Hospitals, Rehabilitation Hospital, Children
- 282J00000X, Hospitals, Religious Nonmedical Health Care Institution
- 284300000X, Hospitals, Special Hospital
Now, just run the query
select * from npidata where Entity_Type_Code = 2 and Provider_Business_Practice_Location_Address_State_Name like 'NY' and Healthcare_Provider_Taxonomy_Code_1 in ('273100000X', '275N00000X', '273R00000X', '273Y00000X', '276400000X', '287300000X', '281P00000X', '281PC2000X', '282N00000X', '282NC0060X', '282NC2000X', '282NR1301X', '282NW0100X', '282E00000X', '286500000X', '2865C1500X', '2865M2000X', '2865X1600X', '283Q00000X', '283X00000X', '283XC2000X', '282J00000X', '284300000X' );
Rinse repeat clean your data daily
Keep in mind that CMS data is notoriously bad. These steps get you started. Your mileage may vary. If you cannot afford to buy your data, keep on conducting your outreach and keep on cleaning your data daily.
PECOS NPPES Linked Data
If you are using our healthcare CRM, you will get these data already built in and can use them for your referring provider tracking.
We take all the hard work out of maintaining healthcare provider data from multiple sources. In general, we merge various data from CMS using the NPI information.
PECOS and the NPI registry
In general, PECOS is Medicare’s enrollment and revalidation system. It is the sole verified source of Medicare data.
We use the Physician Compare database as well. The underlying data source for the physician compare database is PECOS. There are a couple of things to understand here.
- PECOS is Medicare’s enrollment and revalidation system. Based on this, as you can understand, it is the primary source of information about verified Medicare professionals. I.e. professionals that can accept Medicare.
- However, PECOS is an enrollment system and it is not intended to provide up-to-the-minute information.
- For up to date information, you would need to get information from the Physician Compare website (or the database). Physician Compare, on the other hand, uses Medicare claims data to verify the PECOS information. Physician Compare information uses claims to verify group practice affiliations and practice location.
- So, while you can start with PECOS, for up to date information, you need Physician Compare information.
- Please do keep in mind that if someone has NOT billed Medicare all year long, that does not mean that they cannot be in the PECOS database.
- The provider would not even show up in Physician Compare if someone has registered recently (less than 6 months).
- Physician Compare began using Medicare claims data to verify PECOS information in mid-2013.
- A provider has to have at least one practice location address in PECOS else that would NOT show up in Physician Compare. Please do keep this in mind.
- Please do keep in mind that if someone does not have at least one specialty in PECOS, they will NOT show up in Physician Compare.
- For group practices it gets even tougher. For them to appear on Physician Compare, at least two active Medicare eligible professionals (EPs) must reassign their benefits to the group’s TIN. Unless this has been done, you will NOT get their “up to date” information on Physician Compare database.
- That’s why we recommend using BOTH these data sources.
Whose information can you find on PECOS?
- Physician (MD or DO)
- Dental practitioner
- Dental surgeon
- Podiatric practitioner
- Physician assistant (PA)
- Certified clinical nurse specialist
- Nurse practitioner
- Nurse midwife
- Clinical psychologist
- Clinical social worker
PECOS does NOT tell you if a provider has been barred from billing Medicare or not (by the HHS/OIG).