chore: add evaluation result in json format and UI section

This commit is contained in:
Sosokker 2025-05-14 02:54:16 +07:00
parent 131834544e
commit 951695108b
7 changed files with 194 additions and 5 deletions

View File

@ -56,22 +56,50 @@ For validation, we separate into two parts
During fine-tuning, if we provide evaluation data, Vertex AI will calculate the metrics for us. During fine-tuning, if we provide evaluation data, Vertex AI will calculate the metrics for us.
![validation-metrics-1](assets/vertex/model-2-metrics.png) ![validation-metrics-1](assets/model-2-metrics.png)
![validation-metrics-2](assets/vertex/model-3-metrics.png) ![validation-metrics-2](assets/model-3-metrics.png)
##### Post-Fine-Tuning Evaluation ##### Post-Fine-Tuning Evaluation
We approach two methods We approach two methods
1. JSON Syntactic Validity: Parse generated json string with json.loads() 1. JSON Syntactic Validity: Parse generated json string with json.loads()
2. Pydantic Schema Conformance: If the generated output is valid JSON, try to instantiate your CanonicalRecord Pydantic model with the parsed dictionary: CanonicalRecord(**parsed_generated_json). 2. Pydantic Schema Conformance: If the generated output is valid JSON, try to instantiate your CanonicalRecord Pydantic model with the parsed dictionary: CanonicalRecord(**parsed_generated_json).
To calculate the metrics, I run the following code
```bash
uv sync
gcloud auth application-default login # This is required to authenticate with my account
uv run evaluate.py
```
All models are evaluated on these settings. All models are evaluated on these settings.
![evaluation](assets/model-setting.png) ![evaluation](assets/model-setting.png)
#### Results analysis Here are the results
```markdown
# JSON Syntactic Validity:
CustomModel.BORBANN_PIPELINE_2: 91.67%
CustomModel.BORBANN_PIPELINE_3: 100.00%
CustomModel.BORBANN_PIPELINE_4: 100.00%
# Pydantic Schema Conformance (CanonicalRecord Validation Rate):
CustomModel.BORBANN_PIPELINE_2: 63.64%
CustomModel.BORBANN_PIPELINE_3: 0.00%
CustomModel.BORBANN_PIPELINE_4: 0.00%
```
We can see that `borbann-pipeline-2` has the best performance in JSON Syntactic Validity but the worst in Pydantic Schema Conformance.
While `borbann-pipeline-3` and `borbann-pipeline-4` has the best performance in Pydantic Schema Conformance but the worst in JSON Syntactic Validity.
We pick `borbann-pipeline-2` as the final model to deploy according to the evaluation result.
Maybe, this is because the prompt we use to fine-tune the model is not good enough to cover all the cases and provide wrong schema, when we put more data into the model, it fit with that wrong output schema.
For the numerical detail, we can see the [`evaluation_results.json`](evaluation_results.json) file.
### Task 1.4: Model Versioning and Experimentation ### Task 1.4: Model Versioning and Experimentation
@ -79,9 +107,46 @@ Instead of MLFlow, Vertex AI platform provide model versioning and experimentati
![model-version](assets/vertex/model-versioning.png) ![model-version](assets/vertex/model-versioning.png)
We have 3 versions of model
- `borbann-pipeline-2`: 1 samples (2207 tokens)
- `borbann-pipeline-3`: 19 samples (33320 tokens) + evaluation samples (12 samples)
- `borbann-pipeline-4`: 25 samples (43443 tokens) + evaluation samples (12 samples)
Each version differ by the training data amount but are the same on model settings such as temperature, output max length, etc.
### Task 1.5 + 1.6: Model Explainability + Prediction Reasoning ### Task 1.5 + 1.6: Model Explainability + Prediction Reasoning
For model explainability and prediction reasoning, we follow the `Traceable Prompting / Chain-of-Thought (CoT) Prompting` method. For model explainability and prediction reasoning, we follow the `Traceable Prompting / Chain-of-Thought (CoT) Prompting` method. In this case, I use this prompt
```markdown
Explain how to generate output in a format that can be easily parsed by downstream
systems in "reasoning steps" key then output the canonical record.
```
To calculate the metrics, I run the following code
```bash
uv sync
uv run explainability.py
```
The model will explain intuitive behind its decision to us, here is a portion of it
```markdown
* **reasoning steps:** A list of dictionaries. Each dictionary represents a reasoning step. Each dictionary has keys like:
* **step_number:** The numerical order of the step.
* **description:** A natural language description of the step (e.g., "Applying rule X to deduce Y").
* **input:** The input to the reasoning step (e.g., facts, observations).
* **output:** The output of the reasoning step (e.g., new facts, conclusions).
* **rule:** (Optional) The rule applied in the step.
* **canonical record:** A structured representation of the canonical information. The structure depends on the input type and the task. General considerations:
* **Entities:** Represent entities with unique identifiers (UUIDs recommended). Include attributes like name, type, and other relevant details.
* **Relationships:** Represent relationships between entities using predicates (e.g., "works_for," "located_in"). Include attributes like start date, end date, etc.
* **Events:** Represent events with unique identifiers (UUIDs recommended). Include attributes like event type, participants (linked entities), location, time, and other relevant details.
```
For full output, you can see the [`explainability.json`](explainability.json) file.
#### Traceable Prompting #### Traceable Prompting
@ -89,7 +154,7 @@ We add `Traceable Prompting` to the prompt to make the model explainable.
### Task 1.7: Model Deployment as a Service ### Task 1.7: Model Deployment as a Service
Model are deployed as a service in Vertex AI platform via GCP Compute Engine. We pick `borbann-pipeline-4` as the final model to deploy according to the evaluation result. Model are deployed as a service in Vertex AI platform via GCP Compute Engine. We pick `borbann-pipeline-2` as the final model to deploy according to the evaluation result.
Anyway, currently we are not using this model with the pipeline service yet, so we will demonstrate it manually. Anyway, currently we are not using this model with the pipeline service yet, so we will demonstrate it manually.
@ -97,6 +162,15 @@ Anyway, currently we are not using this model with the pipeline service yet, so
### Task 2.1 UI design ### Task 2.1 UI design
This AI component is not directly show up on UI, it is automated process to map data of pipeline service to canonical record.
Anyway, I will show the UI of pipeline service to show the data that we use to map to canonical record.
![pipeline-1](assets/ui/pipeline-1.png)
![pipeline-2](assets/ui/pipeline-2.png)
![pipeline-3](assets/ui/pipeline-3.png)
We don't have any UI to gain feedback from user at this time, but we plan to add it in the future.
### Task 2.2: Demonstration ### Task 2.2: Demonstration
#### UI - Model Interface Design #### UI - Model Interface Design

BIN
assets/evaluate-console.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 75 KiB

BIN
assets/ui/pipeline-1.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 74 KiB

BIN
assets/ui/pipeline-2.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 132 KiB

BIN
assets/ui/pipeline-3.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 110 KiB

12
evaluation_results.json Normal file
View File

@ -0,0 +1,12 @@
{
"json_validity_count": {
"projects/83228855505/locations/us-central1/endpoints/7340996035474358272": 11,
"projects/83228855505/locations/us-central1/endpoints/5289606405207097344": 12,
"projects/83228855505/locations/us-central1/endpoints/7800363197466148864": 12
},
"pydantic_validity_count": {
"projects/83228855505/locations/us-central1/endpoints/7340996035474358272": 7,
"projects/83228855505/locations/us-central1/endpoints/5289606405207097344": 0,
"projects/83228855505/locations/us-central1/endpoints/7800363197466148864": 0
}
}

103
explainability.json Normal file
View File

@ -0,0 +1,103 @@
Here's a breakdown of how to generate output in a machine-parsable format suitable for downstream reasoning systems, along with the canonical record generation:
**1. Understanding the Goal: Machine Comprehension & Reasoning**
The overall goal is to provide information that a computer can understand and use to perform tasks like answering questions, generating new knowledge, or making decisions. This requires:
* **Structured Representation:** Data must be organized in a predictable and consistent manner (e.g., JSON, XML, RDF).
* **Explicit Relationships:** Relationships between entities and concepts must be clearly defined (e.g., "agent performs action on object").
* **Reasoning Steps:** Each logical inference or deduction that leads to the final conclusion must be documented.
**2. Steps to Generate Machine-Parsable Output**
Here's a general process, adaptable based on the input type (e.g., text, sensor data):
**a. Input Analysis & Preprocessing:**
* **Understanding the Input:** Identify the type of input (e.g., text, sensor data). Analyze its format and content.
* **Tokenization & Normalization (if applicable):** Break down text into tokens (words, numbers, symbols). Normalize the tokens (e.g., lowercase). Handle punctuation and other special characters.
* **Named Entity Recognition (NER) (if applicable):** Identify and classify entities in the input (e.g., people, organizations, locations, dates). Tag them.
* **Relationship Extraction (if applicable):** Identify relationships between entities (e.g., "John works for IBM"). Tag them.
* **Event Extraction (if applicable):** Identify events and their participants (e.g., "The car crashed"). Tag them.
* **Semantic Role Labeling (SRL) (if applicable):** Label roles of event participants (e.g., agent, patient, instrument, location).
**b. Reasoning & Inference:**
* **Knowledge Base (if applicable):** Use a knowledge base (KB) or external data source to provide background information, domain knowledge, and inference rules.
* **Reasoning Engine:** Employ a reasoning engine (e.g., rule-based inference, logical deduction, probabilistic inference) to derive new knowledge or conclusions based on the input and KB (if used).
* **Trace Generation:** Document each step of the reasoning process. This will form the "reasoning steps" key in the output.
**c. Canonical Record Generation:**
* **Entity Resolution (if applicable):** Resolve multiple mentions of the same entity to a single canonical representation.
* **Canonicalization:** Standardize the representation of entities, relationships, and events. This might involve mapping synonyms to canonical terms, converting units, or applying other normalizations.
* **Record Creation:** Generate a structured representation (e.g., JSON) that captures the canonicalized information. This will form the "canonical record" key in the output.
**3. Output Format: JSON with Reasoning Steps & Canonical Record**
Generate a JSON object with the following keys:
* **reasoning steps:** A list of dictionaries. Each dictionary represents a reasoning step. Each dictionary has keys like:
* **step_number:** The numerical order of the step.
* **description:** A natural language description of the step (e.g., "Applying rule X to deduce Y").
* **input:** The input to the reasoning step (e.g., facts, observations).
* **output:** The output of the reasoning step (e.g., new facts, conclusions).
* **rule:** (Optional) The rule applied in the step.
* **canonical record:** A structured representation of the canonical information. The structure depends on the input type and the task. General considerations:
* **Entities:** Represent entities with unique identifiers (UUIDs recommended). Include attributes like name, type, and other relevant details.
* **Relationships:** Represent relationships between entities using predicates (e.g., "works_for," "located_in"). Include attributes like start date, end date, etc.
* **Events:** Represent events with unique identifiers (UUIDs recommended). Include attributes like event type, participants (linked entities), location, time, and other relevant details.
**4. Example: Text Input - "John works for IBM in New York."**
**a. Input Analysis & Preprocessing:**
* Tokenization: John, works, for, IBM, in, New, York
* NER: John (PERSON), IBM (ORGANIZATION), New York (LOCATION)
* Relationship Extraction: John works for IBM; IBM located in New York.
**b. Reasoning & Inference:**
* Knowledge Base (Hypothetical): Contains general knowledge about employment, organizational locations, etc.
* Reasoning: Can derive the following:
* John's employer is IBM.
* IBM's location is New York.
**c. Canonical Record Generation:**
* Entity Resolution: Create UUIDs for John and IBM.
* Canonicalization: Ensure standardized names (e.g., "John Doe").
* Record Creation:
{
"reasoning steps": [
{"step_number": 1, "description": "Tokenized input.", "input": "John works for IBM in New York.", "output": null, "rule": null},
{"step_number": 2, "description": "Performed Named Entity Recognition (NER).", "input": "John works for IBM in New York.", "output": {"PERSON": "John", "ORGANIZATION": "IBM", "LOCATION": "New York"}, "rule": null},
{"step_number": 3, "description": "Extracted relationship: John works for IBM.", "input": {"PERSON": "John", "ORGANIZATION": "IBM"}, "output": null, "rule": null},
{"step_number": 4, "description": "Extracted relationship: IBM located in New York.", "input": {"ORGANIZATION": "IBM", "LOCATION": "New York"}, "output": null, "rule": null},
{"step_number": 5, "description": "Inferred John's employer is IBM and IBM's location is New York.", "input": null, "output": null, "rule": null}
],
"canonical record": {
"entities": [
{"uuid": "GENERATED-UUID-PERSON", "name": "John Doe", "type": "PERSON"},
{"uuid": "GENERATED-UUID-ORGANIZATION", "name": "IBM", "type": "ORGANIZATION", "location": {"uuid": "GENERATED-UUID-LOCATION", "name": "New York", "type": "LOCATION"}}
],
"relationships": [
{"uuid": "GENERATED-UUID-RELATIONSHIP", "type": "works_for", "entity1": "GENERATED-UUID-PERSON", "entity2": "GENERATED-UUID-ORGANIZATION"}
],
"events": []
}
}
**5. Considerations & Enhancements:**
* **Knowledge Base (KB):** Use a formal KB (e.g., Wikidata, ConceptNet) to improve reasoning capabilities and canonicalization.
* **Disambiguation:** Handle ambiguity in the input (e.g., multiple John Does).
* **Uncertainty:** Represent uncertainty using probabilities or confidence scores.
* **Domain-Specific Knowledge:** Incorporate domain-specific knowledge (e.g., medical knowledge) for more specialized reasoning.
* **Feedback:** Provide feedback to the user about the reasoning process and the final conclusion.
* **Evaluation:** Evaluate the system's performance using appropriate metrics (e.g., accuracy, precision, recall).
* **Scalability:** Design the system to handle large volumes of data.
* **User Interface:** Create a user interface to visualize the reasoning process and the canonical record.
* **Standard Terminologies:** Employ standardized terminologies (e.g., SNOMED CT, UMLS) to enhance interoperability.
* **Automatic Rule Generation:** Explore techniques for automatically generating inference rules from data.