GitHub - yahiaakkazi/de-reddit-salary-analysis: Data pipeline to visualize data engineers salaries based on reddit posts

Reddit salary analysis

Introduction

Webapp : https://de-reddit-salary-analysis.streamlit.app/

This repository contains a collection of salary information and relevant data for the role of data engineering worldwide. The dataset is curated based on searches and submissions from the Reddit community, specifically the "Salary on the Data Engineering" subreddit. To create this dataset, I leveraged the power of OpenAI's GPT-3.5 language model using its API, and praw, the reddit api, to interact and scrape data from Reddit.

The process begins with the Python modules provided by OpenAI, allowing us to communicate with the OpenAI API and present specific prompts to the GPT-3.5 model. We then use gpt to format the raw submissions into a structured JSON format, making it easier to extract and organize the relevant salary and data engineering information, transforming data from purely unstructured data into semi-structured data.

The goal is ot, once the data is formatted, I store it in a database for further analysis and exploration. Provifing a web hosted app showcasing deep diving analysis regarding the data. This repository serves as a comprehensive resource for anyone interested in studying or analyzing data engineering market as a whole. The dataset can be utilized for various purposes.

The code is pretty straightforward, all one needs to have, is to setup an API key from openai and another one from reddit so that one can obtain a similar dataset. The project is still new and still has certain limitations. For instance, the LLM still doesn't abide strictly to the rules... The aim is to make the formatting work, and then to proceed storing data into a database, and finally presenting its insights into interactive dashboards.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
llm_open		llm_open
prawutils		prawutils
.gitignore		.gitignore
README.md		README.md
data.csv		data.csv
main.py		main.py
requirements.txt		requirements.txt
system_request.txt		system_request.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Reddit salary analysis

Introduction

About

Releases

Packages

Languages

yahiaakkazi/de-reddit-salary-analysis

Folders and files

Latest commit

History

Repository files navigation

Reddit salary analysis

Introduction

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages