Updated Aug 16 2024

Data Engineer 💿 | 2x GSoC @ MetaBrainz ☀️ | Open Source 🧑‍💻 | Python Dev 🐍 | IIT Madras 🧠

📧 Email | 👔 Linkedin | 🐦 Twitter | 💻 GitHub


💫 About Me

Hello, I’m Prathamesh, a seasoned Data Engineer & Analyst with a knack for turning raw data into actionable insights. With 2+ years of experience, and expertise in Data Extraction, Data Integration, Web Scraping, ETL, and Data Analytics, I’ve slashed costs by thousands of $$$ for my clients in Open Source and Corporate domains.

Back in 4th grade, I started tinkering around with computers and developed a severe passion for technology, making it a highlight of most of my teen life. 6 years ago I developed a similar passion for making & consuming insane amounts of music. Given my 3+ years of experience in Music Production, playing Piano, & Audio Engineering under my artist alias “SNÆK” & a lifelong love for computers, my passion for the world of music and technology has now convolved into a passion for Data and AI involving various audio Technologies!

With proven work experience at competitive programs like Google Summer of Code, delivering excellent results to my freelance client, and a background in Artificial Intelligence, I bring a blend of expertise to leverage data and drive results.


🎯 Skills:

  • Domain: Data Engineering, Data Analytics, Python Development, API Scraping, AI Engineering
  • Languages: Python, SQL, Spark SQL, SPARQL, Shell Scripting
  • Data Processing: Pandas, Apache Spark (PySpark), NumPy | Airflow, Kafka, dbt, Databricks DLT Pipelines, Cron
  • Data Integration: PostgreSQL, BigQuery, Delta Lake, Apache Hive | Beautiful Soup, Scrapy, Requests, Regex
  • Cloud & DevOps: Git, Linux, Docker, CI/CD | Databricks, Azure (App Service, Azure Linux VM Compute)
  • Misc: Tableau, Matplotlib, Streamlit, Flask, HTML, CSS | Scikit-Learn, Tensorflow, llama-cpp, hugging-face
  • Course Work: DBMS, Big Data Computing, Business Data Management, Business Intelligence
  • Soft Skills: Leadership, Team Management, Creative Writing, Good Sense of Humor.
  • Audio/Music Production & Mixing: FL Studio, Pro Tools Audacity.
  • Design/Editing: Adobe Photoshop, Da Vinci Resolve, Canva.
  • Languages: English, Marathi, Hindi (Proficient) | Japanese (Elementry), Russian (Elementry).
  • Interests: Technology, Music Production, Playing Piano, Reading, Cats.

🔬 Work Experience

Data Engineer Consultant | Gekko [Apr '24 - Current]

  • Led the design and development of a deterministic NLP solution to extract KPIs and publish reports from daily logs for a major drilling intelligence firm in UK.

  • Achieved over 98% extraction accuracy from extremely noisy text data, and generalized it over 15+ unique oil wells.

  • Optimized RegEx and Pandas code, boosting runtime speed by 30%. Enforced clean and efficient OOP design in Python.

  • Tech Stack: Python (RegEx, Pandas, Plotly), Linux, Excel

Data Engineer Intern | New Engen Inc. [Dec '23 - Mar '24]

  • Assisted with Data Modelling and implementing Data Pipelines to Extract, Load, and Transform raw Facebook Ads data for Business Intelligence and Data Analytics teams at New Engen.

  • Migrated complex pipelines from Salesforce Datorama and Adverity to a custom, In-house solution using GCP BigQuery, dbt, Apache Airflow, and Python.

  • Tech Stack: Python, dbt, Google Cloud Platform, BigQuery, Airflow, Adverity, SalesForce Intelligence (Datorama), Git, etc.

Data Engineering & Analytics Contributor | Google Summer of Code 2023 @ Metabrainz [May '23 - Nov 2023]

  • Assembled an ETL pipeline from Wikidata to the MusicBrainz database, facilitating a 60% increase in new location data, and slashing manual data feeding by 90%. [link]

  • Independently designed and developed a scalable, production ready solution using Python (pandas, multiprocessing, requests, sqlalchemy), SQL (PostgreSQL), and Docker. [architecture] [code]

  • Conducted extensive research and experimentation, optimizing SPARQL queries to cater to Wikidata’s graph data structure to significantly improve data quality and extraction efficiency. [details]

  • Tech Stack: Python, SQL (PostgreSQL), SPARQL, Pandas, Git, Docker, Shell Scripting.

  • Domain: Data Engineering, Data Analytics.

Data Engineering & Analytics Contributor | Google Summer of Code 2022 @ Metabrainz [May '22 - Oct 2022]

  • Enriched, cleaned, and combined 27 billion rows of music streaming data using Python (Pandas, Multiprocessing), SQL (PostgreSQL), and Apache Arrow – achieving high efficiency in Python without Spark. [details]

  • Researched, experimented with, and implemented cutting-edge technologies like Zstandard and Apache Arrow to optimize data lake efficiency, resulting in a 53% reduction in storage and a 9% improvement in read/write speeds. [details]

  • Performed Data Analytics and published Benchmarks, Dashboards, and Reports to help the collaborating teams better understand and utilize the data to train state-of-the-art Music Recommendation Systems.

  • Tech Stack: Python, SQL (PostgreSQL), Pandas, Apache Arrow, Matplotlib, Git, Shell Scripting.

  • Domain: Data Engineering, Data Analytics.

  • Project Summary: https://blog.metabrainz.org/?p=9785

Teaching Assistant (Machine Learning) | Dept. of Artificial Intelligence, GHRCEM Pune [March '22 - May 2023]

  • Conducted hands-on Machine Learning, Data Processing, and Data Visualization sessions for 70+ sophomore students at the Dept. of AI (GHRCEM).
  • Introduced students to Machine Learning concepts like Linear Regression, Naive Bayes (incl. Text Classification), KNN, and Support Vector Machines, etc.
  • Tech Stack: Python, scikit-learn, Pandas, Numpy, Matplotlib, and Seaborn.
  • Domain: Machine Learning.

🏫 Education:

BS. Data Science and Applications | Indian Institute of Technology, Madras [2021 - 2025]

  • An off-campus 4-year degree program in Applied Data Science.
  • CGPA: 8.24

Btech. Artificial Intelligence | G.H. Raisoni College of Engineering & Management, Pune [2020 - 2024]

  • An on-campus undergraduate 4-year degree program in Artificial Intelligence & Computer Science.
  • CGPA: 8.88

🏗️ Projects:

Freelance Project 1 | [proprietary]

  • Scalable, fault-tolerant data pipeline to scrape BSE/NSE notices, extract intelligence using Google Gemini, and publish email updates.
  • Written in Python, orchestrated with Cron. Containerized and Deployed on client VPS. This project was very heavy on Data Scraping and Cleaning using Beautiful Soup, PyPDF, and RegEx.
  • Tech Stack: Python, SQL (PostgreSQL), Cron, Linux, RegEX, Beautiful Soup
  • Skills Used: Data Scraping, ETL, DevOps

Freelance Project 2 | [proprietary]

  • Deterministic NLP solution to incrementally extract KPIs and publish reports from daily logs for a major drilling intelligence firm in UK.
  • Long term project heavy on Python Development. Maintains outstanding code quality with intricate OOP design, reliable tests, and containerized deployment.
  • Tech Stack: Python, Pandas, RegEx (heavy usage), Excel
  • Skills Used: Data Scraping, ETL, DevOps

lastfm-scraper | 🔗 Codebase

  • Lastfm-scraper is a simple platform to scrape, clean, analyze, and download your music listening history for analytics and machine learning applications from last.fm, a music service to track and organize user music listening history across multiple devices and streaming services.
  • Implemented using Python, this project aims to scrape, process and deliver music streaming user data into accessible formats like CSV and JSON using Pandas by scraping the last.fm API. This project is hosted on Azure app service through a CI/CD pipeline using GitHub Actions.
  • Tech Stack: Python, Flask, Pandas, Git, MS Azure, HTML/CSS/JS, REST APIs, GitHub Actions.
  • Skills used: Data Wrangling, Data Cleaning, Cloud, DevOps

Document Topic Modelling | 🔗 Codebase

  • A simple interactive commandline utility to classify text into pre-defined topics using Machine Learning (NLP). This project is based on the LDA (Latent Dirichlet Allocation) model, and built using Python, Scikit-learn, and Gensim.
  • Tech Stack: Python, Git, Scikit-learn, Gensim, Rich.
  • Skills used: Machine Learning, Natural Language Processing.
  • A simple and elegant tableau dashboard that visualizes my monthly financial spending habits. For this project, I fetched data from my personal spreadsheet based budget tracking system hosted on notion.
  • Tech Stack: Tableau, MS Excel.
  • Skills used: Data Visualization, Dashboarding.

Portfolio Site | 🔗 Project Demo | 🔗 Codebase

  • This portfolio site was custom-built with clean looks and minimalism kept in mind. I used Hugo, a static site generator to write the site contents in Markdown for better, distraction-free maintenance. Even the site rendering and hosting are automated with a simple CI/CD pipeline built using GitHub Actions. The site is finally hosted on GitHub pages.
  • Tech Stack: Hugo, Git, GitHub Actions.
  • Skills Used: Web Development, CI/CD.

🏆 Achievements

  • Amongst 967 globally selected candidates out of 43,765 applicants for Google Summer of Code 2022 and 2023.
  • Delivered multiple talks at IIT Madras, engaging and educating 3,000+ students about the benefits of Open Source.
  • Elected as President of the Student’s Association of Artificial Intelligence, GHRCEM Pune.
  • Elected as Vice-President of the IEEE Student’s Chapter, GHRCEM Pune.
  • Wrote a blog with 35k+ Linkedin impressions and 3.7k+ views.
  • Organized multiple college events with 200+ attendees each, achieving an average event rating of 4.59/5.00.
  • Represented the “Hadar Cluster” (South East Asia) at IEEE Asia Pacific’s CLAP (2021) program.

🤝 Leadership / Extracurriculars

Teaching Assistant (Machine Learning) | Dept. of Artificial Intelligence, GHRCEM Pune [Mar 2023 '21 - May 2023]

  • Conducted hands-on Machine Learning, Data Processing, and Data Visualization sessions for 70+ sophomore students at the Dept. of AI (GHRCEM).
  • Introduced students to Machine Learning concepts like Linear Regression, Naive Bayes (incl. Text Classification), KNN, and Support Vector Machines, etc. using Python, scikit-learn, Pandas, Numpy, Matplotlib, and Seaborn.

President | Student’s Association of Artificial Intelligence, GHRCEM, Pune [Nov '21 - Mar 2022]

  • Operated Human Resources, Planning, and Execution for all events at the Department of AI, GHRCEM, Pune.
  • Hosted events and workshops like “Tech Talks 1.0: Biostatistics w/ Mr. Shariq Mohammed, Boston University”, and “YOU 2.0: The complete personality upliftment program” with 200+ attendees and 4.59/5.00 average event ratings.

Co-Chair | IEEE Student’s Chapter, GHRCEM Pune [Mar '21 - Feb 2023]

  • Hosted several flagship events at IEEE Pune Section - like IEEE CODE-STROM [2022], EAC Funded Cloud and Data Engineering Workshop [2022]
  • Represented the “Hadar Cluster” (South East Asia) at IEEE Asia Pacific’s CLAP [2021] program.

Speaker and Project Lead | Ek Bharat Shrestha Bharat Club, GHRCEM, Pune [Sep '21 - Nov 2021]

  • Designed and presented 5+ inter-state presentations to Aryan Institute of Technology, Bhubaneshwar, Odisha; while Representing GH Raisoni College of Engineering and Management Pune, Maharashtra.

Vice President - Music Club, GH Raisoni College of Engineering & Management, Pune [Aug 2021 - Nov 2021]

  • Operated Human Resources, Planning, and execution for 6+ introductory and jamming sessions.

⌨️ Blogs:

📜 Certificates:


🎹 Hobbies

  • In my free time, I like playing Piano 🎹, and Producing Music 🎧 under my artist alias SNÆK.
  • I also LOVE listening to music whenever I can! Check out my streams here: last.fm - snaekboi My Last.fm
  • Been trying to get into books as well! My favorites are “The Subtle Art of Not Giving a F*ck” by Mark Manson, and “Tokyo Ghoul” by Sui Ishida.