Skip to content

Data pipeline to visualize data engineers salaries based on reddit posts

Notifications You must be signed in to change notification settings

yahiaakkazi/de-reddit-salary-analysis

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Reddit salary analysis

Introduction

Webapp_screen

Webapp : https://de-reddit-salary-analysis.streamlit.app/

This repository contains a collection of salary information and relevant data for the role of data engineering worldwide. The dataset is curated based on searches and submissions from the Reddit community, specifically the "Salary on the Data Engineering" subreddit. To create this dataset, I leveraged the power of OpenAI's GPT-3.5 language model using its API, and praw, the reddit api, to interact and scrape data from Reddit.

The process begins with the Python modules provided by OpenAI, allowing us to communicate with the OpenAI API and present specific prompts to the GPT-3.5 model. We then use gpt to format the raw submissions into a structured JSON format, making it easier to extract and organize the relevant salary and data engineering information, transforming data from purely unstructured data into semi-structured data.

The goal is ot, once the data is formatted, I store it in a database for further analysis and exploration. Provifing a web hosted app showcasing deep diving analysis regarding the data. This repository serves as a comprehensive resource for anyone interested in studying or analyzing data engineering market as a whole. The dataset can be utilized for various purposes.

The code is pretty straightforward, all one needs to have, is to setup an API key from openai and another one from reddit so that one can obtain a similar dataset. The project is still new and still has certain limitations. For instance, the LLM still doesn't abide strictly to the rules... The aim is to make the formatting work, and then to proceed storing data into a database, and finally presenting its insights into interactive dashboards.

About

Data pipeline to visualize data engineers salaries based on reddit posts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages