|
1 | | -# github-classifier |
| 1 | +# Classifier for GitHub Repos |
2 | 2 |
|
3 | | -**short description** |
| 3 | +## Table of Contents |
| 4 | +- [Intro](#intro) |
| 5 | +- [Installation for Users](#installation-instruction-for-users) |
| 6 | +- [Installation for Devs](#installation-instruction-for-devs) |
| 7 | +- [Expectation for Devs](#expectation-for-devs) |
| 8 | +- [Known Problems / Limitations](#known-problems--limitations) |
| 9 | +- [Help](#help) |
4 | 10 |
|
5 | | -This repository contains a deep-learning based classification tool for software repositories. The tool utilizes the ecore metamodel 'type graph' and a graph convolutional network. To use the tool, run 'main.py' after adding the directory containing the repositories you want to classify. |
| 11 | +## Intro: |
6 | 12 |
|
7 | | -If you want to train the tool with different labels, replace the current labels with your own (or add them to the labels) in GraphClasses.py, and in function 'multi_hot_encoding' in Encoder.py. Optionally also in function 'count_class_elements' in CustomDataset.py if you want to know the number of samples in each class in your dataset. |
8 | | -The labels in the tool are not mutually exclusive and are multi-hot encoded. |
| 13 | +This repository features a deep learning classifier designed for the analysis of software repositories. |
| 14 | +The tool employs the ecore metamodel's 'type graph' in conjunction with a graph convolutional network. |
| 15 | +Presently, the classifier categorizes repositories into four distinct classes: Application, Framework, Library, and Plugin. |
| 16 | +It is important to note that the labels utilized by the tool are **not mutually exclusive** and are represented in a multi-hot encoded format. |
9 | 17 |
|
10 | | -Currently, the tool only processes Python files. |
| 18 | +## Installation Instruction for Users: |
| 19 | +1. Clone the repository by executing the following command: |
| 20 | +`git clone https://github.com/isselab/github-classifier.git` |
| 21 | +2. Open the cloned repository using your preferred Integrated Development Environment (IDE). |
| 22 | +For the purposes of this instruction, we will assume the use of PyCharm from JetBrains. |
| 23 | +3. Change the directory to data/input by running the following command: |
| 24 | +`cd ~/data/input` |
| 25 | +4. Clone the repositories you wish to analyze by executing: |
| 26 | +`git clone LINK_TO_REPO_YOU_WANT` |
| 27 | +5. run main.py |
11 | 28 |
|
12 | | -**labels** |
| 29 | +The default threshold for identification is set at 50%. |
| 30 | +If you wish to modify this threshold, please locate the relevant settings in the settings.py file. |
| 31 | +After making the necessary adjustments, ensure to rerun main.py to apply the changes. |
13 | 32 |
|
14 | | -Application, Framework, Library, Plugin |
| 33 | +## Installation Instruction for Devs: |
15 | 34 |
|
16 | | -**data** |
| 35 | +### Basic Installation: |
| 36 | +1. Clone the repository by executing the following command: |
| 37 | +`git clone https://github.com/isselab/github-classifier.git` |
| 38 | +2. Open the cloned repository using your preferred Integrated Development Environment (IDE). |
17 | 39 |
|
18 | | -Dataset with Python software repositories from GitHub, all with a dependency on at least one ML library. |
19 | | -The labeled repositories the tool is trained with are in data/labeled_dataset_repos.xlsx. |
| 40 | +### Retraining: |
| 41 | +1. Check data/labeled_dataset_repos.xlsx. |
| 42 | +This xlsx file contains the labeled repository's the tool is trained with. |
| 43 | +You may want to change it accordingly to your needs. |
| 44 | +2. We strongly recommend utilizing a GPU for training purposes. |
| 45 | +To verify GPU availability, please run the TorchGPUCheck.py script. |
| 46 | +If you get the Result "Cuda is available!" you may proceed to step 3. |
| 47 | +If the output indicates that "Cuda is not available," please follow the instructions provided in the terminal. |
| 48 | +Additionally, refer to the guide in the [Help](#help) section for further assistance in resolving any issues. |
| 49 | +3. Run prepareDataset.py |
| 50 | +4. Change the experiment_name in settings.py in the training section. |
| 51 | +5. Run training.py |
20 | 52 |
|
21 | | -**requirements** |
22 | 53 |
|
23 | | -pyecore~=0.14.0 or higher versions |
| 54 | +## Expectation for Devs: |
| 55 | +### Recommended Workflow: |
| 56 | +1. Create an issue in the GitHub issue page. |
| 57 | +2. Open a branch named after the issue |
| 58 | +3. Write code that fixes the issue |
| 59 | +4. Write test code to be sure it works. |
| 60 | +5. Comment your code well to be sure it can be understood. |
| 61 | +6. Create a merge request |
24 | 62 |
|
25 | | -autopep8 |
| 63 | +## Known Problems / Limitations: |
| 64 | +- The Tool only processes Python files. |
| 65 | +- Dataset contains Python software repositories from GitHub, all with a dependency on at least one ML library. |
| 66 | +- Labels can not be changed easily, WIP |
26 | 67 |
|
27 | | -GRaViTY tool for visualizing the metamodels, see "https://github.com/GRaViTY-Tool/gravity-tool?tab=readme-ov-file" for instructions on how to install the tool |
| 68 | +## Help |
| 69 | +- Torch CUDA Guide, see "https://www.geeksforgeeks.org/how-to-set-up-and-run-cuda-operations-in-pytorch/" |
| 70 | +- GRaViTY tool for visualizing the metamodels, see "https://github.com/GRaViTY-Tool/gravity-tool?tab=readme-ov-file" |
0 commit comments