Need a Python expert familiar with large XML and DTD files, especially bulk data downloaded from the US Patent Office website.

The Python expert needs to parse the large XML files, vectorize it using standard techniques using NLTK or similar. The expert then needs to upoad the vectorized data into Pinecone for further searching.

The data comprises 20 years of patent data.

Please use parsers others have developed specifically for USPTO data

https://github.com/lettergram/parse-uspto-xml/tree/master/config

https://github.com/TamerKhraisha/uspto-patent-data-parser/blob/master/uspto.py

Once the vectorized data has been uploaded into Pinecone the job may continue with developing "fine-tuning" chatgpt models.

Hourly Range: $35.00-$70.00
Posted On: April 26, 2023 04:27 UTC
Category: Database Development
Skills:Data Migration, Redis, Python, XML, NLTK
Country: United States
click to apply