Skip to content

Commit

Permalink
Merge pull request #26 from unoplat/12-feat-devsecops
Browse files Browse the repository at this point in the history
fix: structure according to poetry
  • Loading branch information
JayGhiya authored Jun 25, 2024
2 parents 4659a48 + 3acf297 commit 49fbd68
Show file tree
Hide file tree
Showing 65 changed files with 6,083 additions and 56 deletions.
9 changes: 9 additions & 0 deletions .github/workflows/python_basic_check.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
name: Develop Branch Action
on:
pull_request_target:
types:
- opened
branches:
- main
paths:
- 'codebase_understanding/**'
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -21,3 +21,13 @@ codebase_understanding/dependencies/analysers
codebase_understanding/nodeparser/__pycache__
codebase_understanding/nodeparser/tests/__pycache__
codebase_understanding/codeagent/__pycache__
unoplat-code-confluence/codeagent/__pycache__
unoplat-code-confluence/codebaseparser/__pycache__
unoplat-code-confluence/data_models/__pycache__
unoplat-code-confluence/downloader/__pycache__
unoplat-code-confluence/loader/__pycache__
unoplat-code-confluence/nodeparser/__pycache__
unoplat-code-confluence/nodeparser/tests/__pycache__
unoplat-code-confluence/settings/__pycache__
unoplat-code-confluence/utility/__pycache__
unoplat-code-confluence/data_models/dspy/__pycache__
3 changes: 3 additions & 0 deletions README.md
100755 → 100644
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
<<<<<<< HEAD
=======
# Unoplat-CodeConfluence - Where Code Meets Clarity


Expand Down Expand Up @@ -223,3 +225,4 @@ These are the people because of which this work has been possible. Unoplat code



>>>>>>> main

This file was deleted.

22 changes: 0 additions & 22 deletions codebase_understanding/loader/parse_json.py

This file was deleted.

File renamed without changes.
216 changes: 216 additions & 0 deletions unoplat-code-confluence/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,216 @@
# Unoplat-CodeConfluence - Where Code Meets Clarity


## Current Problem with doing Repository level Documentation using AI Tooling

### Process Overview:

1. Indexing Code Files: All code files are indexed into a vector database using embeddings that capture the semantic meaning of the code.
2. Query Processing: The system uses fine-tuned language models to interpret the user's query about the codebase.
3. Retrieval-Augmented Generation: The language model performs RAG to fetch relevant code snippets or documentation based on the query.
4. Reranking: Retrieved results are reranked to present the most relevant information first based on the context and specifics of the query.

### Challenges:

1. Limited Context Windows: Most AI tools suffer from limited context windows of large language models, which can hinder their ability to process large blocks of code or extended documentation effectively.
2. Lack of Long-term Memory: These tools generally do not incorporate long-term memory, which affects their ability to remember past interactions or understand extensive codebases deeply.

3. Inefficiency: This process can be computationally expensive and slow, particularly for large codebases, due to the extensive indexing and complex querying mechanisms.
4. Cost: The operational costs can be significant because of the resources required for maintaining up-to-date embeddings and processing queries with advanced AI models.
5. Compliance and Security Issues: Storing and processing entire codebases can lead to compliance issues, especially with code that contains sensitive or proprietary information.
6. First Principles Concern: The approach may not align with first principles of software engineering, which emphasize simplicity and minimizing complexity across programming languages constructs and frameworks.

### Mermaid Diagram of the Process:
Here's a visual representation of the process using a Mermaid diagram:

```mermaid
graph LR
A[Start] --> B[Index Code Files]
B --> C[Process Query]
C --> D[Retrieve Relevant Data]
D --> E[Rerank Results]
E --> F[Present Results]
F --> G[End]
```
This diagram helps visualize the workflow from the start of the query to the presentation of results, illustrating the steps where inefficiencies and complexities arise.

### Unoplat Solution to all of these problems

#### Unoplat Solution: Deterministic Information Ingestion for Enhanced Code Understanding
The Unoplat approach offers a significant shift from the conventional AI-powered tools by opting for a deterministic method to manage and understand codebases. Here’s an overview of how Unoplat proposes to resolve the inefficiencies of current AI-powered code assistance tools:

#### Process Overview:

1. Language-Agnostic Parsing: Unoplat uses a language-agnostic parser, similar to generic compilers, to analyze and interpret any programming language or framework. This step involves no AI, focusing solely on deterministic parsing methods.
2. Generating Semi-Structured JSON: From the parsing step, Unoplat generates semi-structured JSON data. This JSON captures essential constructs and elements of the programming languages being analyzed, providing a clear, structured view of the codebase without reliance on AI for code understanding.
3. Enhancing Metadata: The semi-structured JSON is then used to enhance the metadata in a single attribute with help of oss instruct model.
4. Integration with Open Source LLMs: Leveraging open-source large language models (LLMs), Unoplat combines the enriched metadata with multi-agentic workflows. This integration aims to produce a more sophisticated and useful "Code Atlas," which developers can use to navigate and understand large and complex codebases more effectively.
5. Output: The output is a highly detailed, easily navigable representation of the codebase, allowing developers to understand and modify code with much higher accuracy and speed than traditional AI-based tools.

#### Benefits:
1. Deterministic and Transparent: The deterministic nature of the process ensures transparency and reliability in how code is analyzed and understood.
2. Cost-Effective: Reduces the dependency on expensive AI models and the associated computational and maintenance costs.
3. Compliance and Security: By not relying on AI models trained on external data, Unoplat minimizes potential compliance and security issues.
4. Scalability: The approach is highly scalable, as it can handle any programming language or framework without needing specific model training.
##### Mermaid Diagram of the Process:
Here’s a visual representation using a Mermaid diagram to illustrate the Unoplat process:

```mermaid
graph TD
A[Start] --> B[Language-Agnostic Parsing]
B --> C[Generate Semi-Structured JSON]
C --> D[Enhance Metadata]
D --> E[Integrate with Open Source LLMs]
E --> F[Generate Enhanced Code Atlas]
F --> G[End]
```
This diagram outlines the Unoplat process from the initial parsing of the codebase to the generation of an enhanced Code Atlas, highlighting the deterministic and structured approach to managing and understanding codebases.

## Unoplat Solution to the Current Problem
```mermaid
flowchart TD
Start(Unoplat GUI/Terminal Experience)
Parse[Parse Java Codebase]
CHAP[Common Hierarchical Abstract Parser]
IC[Information Converter]
Archguard[Archguard]
Output[Semi-structured JSON - Class Metadata]
Litellm[litellm using sota oss llm for reasoning- phi3-14b-instruct]
Finer_Summary[Finer Summary per Class]
CrewAI_Manager_Agent[Manager_Crewai]
Data_Engineer_Agent[Data Engineer - Job is to provide per class unoplat markdown spec]
Unoplat_Custom_Tool[Custom Tool - Fetch Finer summary one at a time until end of items using long term memory]
Software_Engineer_Agent[Software_Engineer_CrewAi - Clean up markdown]
Senior_Software_Engineer_Agent[Senior Software Engineer- Adjust/Modify overall summary based on current summary]
Senior_Markdown_Technical_Documentation_Specialist[Unoplat Markdown tech doc specialist- Analyze the evolving summary for accuracy and insights based on all available classes' metadata and include flow/interactions within the codebases between classes.]
MarkdDownOutput[MarkdDownOutput]
Start --> Parse
Parse --> CHAP & IC & Archguard
CHAP & IC & Archguard --> Output
Output --> Litellm
Litellm --> Finer_Summary
Finer_Summary --> CrewAI_Manager_Agent
CrewAI_Manager_Agent --> Data_Engineer_Agent
Data_Engineer_Agent --> Unoplat_Custom_Tool
Unoplat_Custom_Tool --> Software_Engineer_Agent
Software_Engineer_Agent --> Senior_Software_Engineer_Agent
Senior_Software_Engineer_Agent --> Senior_Markdown_Technical_Documentation_Specialist
Senior_Markdown_Technical_Documentation_Specialist --> MarkdDownOutput
```



## Example:

### Input:
```
Local workspace from https://github.com/DataStax-Examples/spring-data-starter.git
```

### Output:
```
# Order Class
- **Package**: `com.datastax.examples.order`
- **File Path**: `src/main/java/com/datastax/examples/order/Order.java`
- **Responsibility**: This class represents an order in the system, encapsulating all necessary details such as product quantity, name, price, and added-to-order timestamp.
## Fields
Each field corresponds to a column of our Cassandra database table. The annotations indicate how each Java data type is mapped to its respective datatype in Cassandra.
- **OrderPrimaryKey**: `None`
- **Type**: This field represents the unique identifier for an order within the system, serving as the primary key. No dependencies are injected here.
- **Integer**
- **type_key**: `productQuantity`
- **Type**: Represents the quantity of a product in this particular order. Annotated with `@Column("product_quantity")` and `@CassandraType(type = CassandraType.Name.INT)` to map it properly within our database structure. No dependencies are injected here.
- **String**
- **type_key**: `productName`
- **Type**: Stores the name of a product in this order. Annotated with `@Column("product_name")` and `@CassandraType(type = CassandraType.Name.TEXT)` to ensure accurate representation in our database schema. No dependencies are injected here.
- **Float**
- **type_key**: `productPrice`
- **Type**: Contains the price of a product within this order. Annotated with `@CassandraType(type = CassandraType.Name.DECIMAL)` to map it correctly in our database system. No dependencies are injected here.
- **Instant**
- **type_key**: `addedToOrderTimestamp`
- **Type**: Stores the timestamp of when this order was added to the system. Annotated with `@CassandraType(type = CassandraType.Name.TIMESTAMP)` for accurate mapping in our database schema. No dependencies are injected here.
# OrderController Class Summary
## Package
com.datastax.examples.order
## File Path
src/main/java/com.datastax.examples/order/OrderController.java
## Fields
- **OrderRepository** (private OrderRepository orderRepository)
- **Type**: The field is an instance of the OrderRepository class, which contains methods for accessing and manipulating data from a database using JPA or similar technologies. It serves as a dependency injection to enable interaction with the repository layer inside the controller's methods.
## Methods
- **root()**: `ModelAndView`
- **Summary**: Returns a ModelAndView object representing the root page of the order management system, which typically includes links or navigation elements for other pages/actions within the application.
- **createOrder(Request req, Response res)**: `Order`
- **Summary**: Processes an HTTP POST request to create a new Order object with data from the client's input and saves it in the database using the repository layer. It then returns the created order as the response payload.
- **updateOrder(UUID id, Request req, Response res)**: `Order`
- **Summary**: Processes an HTTP PUT or PATCH request to update a specific Order object identified by its UUID with new data from the client's input and saves it in the database using the repository layer. It then returns the updated order as the response payload.
- **deleteOrder(UUID id)**: `void`
- **Summary**: Processes an HTTP DELETE request to remove a specific Order object identified by its UUID from the database and handles any related cleanup or cascading deletions using the repository layer. It does not return any response payload.
# OrderPrimaryKey Class Summary
## Package
`com.datastax.examples.order`
## File Path
`src/main/java/com/datastax/examples/order/OrderPrimaryKey.java`
## Responsibility
This class represents the primary key for an Order entity, containing UUID fields to uniquely identify each order and its associated product within a Cassandra database.
## Fields
- **UUID**: `orderId`
- **Type**: Represents the unique identifier of the order itself. It is marked with `@PrimaryKeyColumn(name = "order_id", ordinal = 0, type = PrimaryKeyType.PARTITIONED)` to denote its role as a partition key in Cassandra's primary key structure.
- **UUID**: `productId`
- **Type**: Represents the unique identifier of the product associated with the order. This is also marked with `@PrimaryKeyColumn(name = "product_id", ordinal = 1, type = PrimaryKeyType.CLUSTERED)` indicating that it serves as a clustering key in Cassandra's primary key scheme, which further refines the data retrieval within each partition identified by `orderId`.
(No methods are defined for this class.)
```
## Current Stage

### Status: Alpha
### Blockers before user adoption:
1. Performance issue with per class summary [Dspy based pipelines in progress with finetuned data]
2. Multi agent workflow not exiting due to potential enhancements needed in our crewai implementation.
3. Moving to cli from tui. [DONE]

## Tech Stack

1. [Chapi](https://chapi.phodal.com/)
2. [PyTermGui](https://ptg.bczsalba.com/)
3. [Litellm](https://docs.litellm.ai/docs/)
4. [ArchGuard](https://github.com/archguard/archguard)
5. [CrewAi](https://www.crewai.com/)
6. [loguru](https://loguru.readthedocs.io/en/stable/api/logger.html)
7. [PyTest](https://pytest.org/)
8. [Pydantic](https://www.pydantic.dev)
9. DSPY


## Credits/heroes/supporters

These are the people because of which this work has been possible. Unoplat code confluence would not exist without them.
1. [Phodal from Chapi and ArcGuard](https://github.com/phodal)
2. [Ishaan & Krrish from Litellm]([email protected] / [email protected])
3. [Joao Moura from crewai](https://github.com/joaomdmoura)
4. [Vipin Shreyas Kumar](https://github.com/vipinshreyaskumar)
5. [Apeksha](https://github.com/apekshamehta)
Empty file.
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,7 @@
import datetime
from codebaseparser.ArchGuardHandler import ArchGuardHandler
import re
from data_models.chapi_unoplat_codebase import UnoplatCodebase
from downloader.downloader import Downloader
from loader import iload_json, iparse_json
from loader.json_loader import JsonLoader
Expand All @@ -16,8 +17,8 @@


def main(iload_json, iparse_json,isummariser,json_configuration_data):
settings = AppSettings()
get_codebase_metadata(json_configuration_data,settings,iload_json,iparse_json,isummariser)
#settings = AppSettings()
get_codebase_metadata(json_configuration_data,iload_json,iparse_json,isummariser)


def handle_toggle(value):
Expand All @@ -26,7 +27,7 @@ def handle_toggle(value):
logger.info(f"Selected language: {value}")


def get_codebase_metadata(json_configuration_data,settings,iload_json,iparse_json,isummariser):
def get_codebase_metadata(json_configuration_data,iload_json,iparse_json,isummariser):
# Collect necessary inputs from the user to set up the codebase indexing
local_workspace_path = json_configuration_data["local_workspace_path"]
programming_language = json_configuration_data["programming_language"]
Expand All @@ -44,7 +45,6 @@ def get_codebase_metadata(json_configuration_data,settings,iload_json,iparse_jso
programming_language,
output_path_field,
codebase_name_field,
settings,
github_token,
arcguard_cli_repo,
local_download_directory,
Expand Down Expand Up @@ -83,7 +83,7 @@ def ensure_jar_downloaded(github_token,arcguard_cli_repo,local_download_director

return jar_path

def start_parsing(local_workspace_path, programming_language, output_path, codebase_name, settings, github_token, arcguard_cli_repo, local_download_directory, iload_json, iparse_json, isummariser):
def start_parsing(local_workspace_path, programming_language, output_path, codebase_name, github_token, arcguard_cli_repo, local_download_directory, iload_json, iparse_json, isummariser):

# Log the start of the parsing process
logger.info("Starting parsing process...")
Expand All @@ -108,15 +108,20 @@ def start_parsing(local_workspace_path, programming_language, output_path, codeb
chapi_metadata_path = archguard_handler.run_scan()

chapi_metadata = iload_json.load_json_from_file(chapi_metadata_path)


current_timestamp = datetime.datetime.now().strftime("%Y%m%d%H%M%S")

output_filename = f"{codebase_name}_{current_timestamp}.md"

with open(os.path.join(output_path, output_filename), 'a+') as md_file:
for node in iparse_json.parse_json_to_nodes(chapi_metadata, isummariser):
if node.type == "CLASS":
md_file.write(f"{node.summary}\n\n")
unoplat_codebase : UnoplatCodebase = iparse_json.parse_json_to_nodes(chapi_metadata, isummariser)

print(unoplat_codebase.model_dump())

# with open(os.path.join(output_path, output_filename), 'a+') as md_file:
# for node in iparse_json.parse_json_to_nodes(chapi_metadata, isummariser):
# if node.type == "CLASS":
# md_file.write(f"{node.summary}\n\n")
# with open('codebase_summary.json', 'w') as file:
# json.dump(codebase_metadata, file)

Expand All @@ -135,6 +140,7 @@ def start_parsing(local_workspace_path, programming_language, output_path, codeb
isummariser = NodeSummariser()
#loading the config
json_configuration_data = iload_json.load_json_from_file(args.config)
print(json_configuration_data)

#loading and setting the logging config
logging_config = iload_json.load_json_from_file("loguru.json")
Expand Down
Loading

0 comments on commit 49fbd68

Please sign in to comment.