Want to make creations as awesome as this one?

Transcript

of law

team

Fine-tuning LLMs to auto generate training datasets with question answer pairs from legal documents

apex

AGENDA

Next Steps
Model Training & Validation
Q&A
Model Evaluation & Findings
Model Selection
Dataset
Project Network & Overview

AN OVER VIEW OF TEAM MEMBERS, PARTNER, & PROBLEM SPACE

01

Project Network & Overview

Accure specializes in AI and ML platforms, providing data engineering and professional services. With over 50+ deployments, their solutions automate processes, predict maintenance needs, enhance supply chain visibility, and reduce costs through advanced data analysis and machine learning technologies.

ACCURE CUSTOMERS AND PARTNERS:
Meet our Partner

GPT

securegpt

Visualization

Data Warehousing

Deployment

insent

Impulse

Momentum

Accure offers an array of products and solutions with a proven track record across various industries

Developer

Alex Marcia-Gonzalez

Developer

Henry Wu

Developer

Mitch Breeden

Product Owner

Sushmitha Tamilselvan

Scrum Master

Jacob Baisden

Team Overview

While LLMs are versatile and potent tools, their utility is contingent on the quality of the datasets used for training. Without adequate dataset’s to be used in the fine-tuning process, LLMs remain generic.

FORBES

"If large language models are able to generate their own training data and use it to continue self-improving, this could render irrelevant the data shortage. It would represent a mind-bending leap forward for LLMs"

LLM dataset creation demands specialized knowledge from domain-specific subject matter experts.

expert knowledge needed

High costs due to expert involvement and extensive labor hours incurred.

financially demanding

Fine-tuning LLMs with manual data labeling is a lengthy, labor-intensive process.

TIME CONSUMING effort

These current fine-tuning practices remain unideal.

Traditional methods to create datasets that fine-tune LLMs involve manual data labeling

Minimal human oversight streamlines and simplifies LLM dataset development. and eases time and finacial burdens

minimal oversight

Self-generated datasets facilitate automated, independent fine-tuning of LLMs.

self generated

Quality question-answer pairs for LLMs ensure an efficeint output.

quality data Pairs

Team APEX ventures to address this challange

There is a need for a streamlined approach to mitigate the status quo

AN OVER VIEW OF PROCURED DATA AND ITS USE CASE

02

DATASETS

THE CASELAW ACCESS PROJECT

unique

cases

6.9M

The caselaw project dates as far back as the early 1800's. However for our purposes we focus on more relevant cases (2013-2023)

Expansive

Harvard Law School Compiled and has owner ship over the project. Despite this, the data remains freely available

Harvard Owned

Open source status inidicates the lciense is free for public use and the dataset has no restrictions for researchers

Open Source

Team APEX collected the data through an API (for large number of cases) and accessed individual cases through the website above.

data assesment

Reliable information

Minor data conditioning was needed due to the quality of the provider

01

review, correct and rpeat as needed

04

Receive the Question-Answer/Answer Question Pair

Highlight the information you wish to create the questions for

data labeling process

03

02

Feed corpus (court case) into a GPT along with general instructions

100 Question Answer Pairs

2% Eval

18% used for testing

80% used for training

description

type

field

item #

resulting dataset

ANALYSIS OF MODEL SELECTION PROCESS

03

MODEL SELECTION

PARAMETERS

110M

PARAMETERS

175B

VS

PARAMETERS

110M

PARAMETERS

40-180B

VS

PARAMETERS

7B-70B

PARAMETERS

40-180B

VS

PARAMETERS

7B-70B

PARAMETERS

~1B-11B

VS

We chose T5 Flan Large & XL to train. In order to make the frameworks manageable, LoRa Is applied to this models as well

Working with orders of magnitude

Selected Model

t5 ARCHITECTURE & LoRA

WA
Wa
PretrainedWeights
Inputs
Inputs
Q & V LoRA

Positional Encoding

Encoder 1

Encoder 2

Decoder 1

soft max

Linear

Decoder 2

feed forward

add & normalize

self attention

add & normalize

enc/dec atn

add & normalize

self attention

add & normalize

feed forward

add & normalize

t5 ARCHITECTURE & LoRA

SELECTED MODEL

+ Info

Resource Requirements

Given the quantization of the model we found that resource efficency can be achieved using a single NVIDIA A100 40GB GPU.This also leaves room for overhead

APPROACHES TO TRAINING AND VALIDATION RESULTS

04

MODEL TRAINING & VALIDATION

Hypothetical Scenario:

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Standard LLM Metrics are Unreliable for QuestionGeneration

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge

Hypothetical Scenario:

Estimated ROUGE scores:

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

LLM Generated Question:Why did the law firm handle the case pro bono?

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Rouge-2 estimate is Perfect!

100%

Rouge-1 estimate is Perfect!

100%

Rouge-L estimate is Perfect!

100%

Standard LLM Metrics are Unreliable for QuestionGeneration

Hypothetical Scenario:

Estimated ROUGE scores:

Rouge-L estimate is not better than chance

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

Rouge-2 estimate is poor

25%

Rouge-1 estimate is not better than chance

50%

50%

LLM Generated Question:How did the law firm support the community?

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Standard LLM Metrics are Unreliable for QuestionGeneration

Hypothetical Scenario:

loss

Indicator of LLM's prediction error during training.

question reliability

Models propensity towards consistent question generatoin

precision

Ratio of true positives to all positive predictions

recall

Ratio of true positives to all actual positives

f1

Harmonic mean of precision and recall

This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:

Enter BERTScores

Estimated ROUGE scores:

Rouge-L estimate is not better than chance

pre-determining metrics for evaluation is crucial for meaningful insights. the following example illustrates our need to deviate from bleu/rouge:

Rouge-2 estimate is poor

25%

Rouge-1 estimate is not better than chance

50%

50%

LLM Generated Question:How does the law firm to support the community?

Expected Question (From Dataset):Why did the law firm handle the case pro bono?

Context (From Dataset):The law firm handled the case pro bono to support the community.

Standard LLM Metrics are Unreliable for QuestionGeneration

loss

Indicator of LLM's prediction error during training.

question reliability

Models propensity towards consistent question generatoin

precision

Ratio of true positives to all positive predictions

recall

Ratio of true positives to all actual positives

f1

Harmonic mean of precision and recall

This model agnostic solution Leverages interchangeable context embeddings from BERT Transformers to focus on meaning and context of outputs as opposed to words used and the order they are used on. As a result WE are able to use the adjacent metrics:

Enter BERTScores

VALIDATION LOSS

1.65

LLM Generated Question:Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild

Expected Question (From Dataset):What were the psychiatric diagnoses of Mr. Van Dusen during the surreptitious videotaping period?

Context (From Dataset):Input: answer: Mr. Van Dusen was diagnosed with Voyeuristic Disorder and Major Depressive Disorder — Mild by Dr. Jeffrey S. Janofsky. context: During the hearing in this matter, the Commission offered the testimony of psychiatrist Jeffrey S. Janofsky, MD, who was accepted by the hearing judge as an expert. According to Dr. Janofsky, at the time of the surreptitious videotaping, Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild.

0%

QUESTIONRELIABILITY

81%

F1

82%

RECALL

80%

PRECISION

MODEL BASELINES

FLAN-T5-LARGE

VALIDATION LOSS

1.13

LLM Generated Question:Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild

Expected Question (From Dataset):What were the psychiatric diagnoses of Mr. Van Dusen during the surreptitious videotaping period?

Context (From Dataset):Input: answer: Mr. Van Dusen was diagnosed with Voyeuristic Disorder and Major Depressive Disorder — Mild by Dr. Jeffrey S. Janofsky. context: During the hearing in this matter, the Commission offered the testimony of psychiatrist Jeffrey S. Janofsky, MD, who was accepted by the hearing judge as an expert. According to Dr. Janofsky, at the time of the surreptitious videotaping, Mr. Van Dusen was suffering from Voyeuristic Disorder and Major Depressive Disorder — Mild.

11%

QUESTIONRELIABILITY

69%

F1

69%

RECALL

69%

PRECISION

MODEL BASELINES

FLAN-T5-XL

01· Gradient Clipping

Loss · Selection Metric

Epochs · Eval Strategy

.01 · Weight Decay

Varied · Learning Rate

05 · Warm Up Steps

10 · Epochs

Performance Optimizers

Hyperparamters

1e-5

2e-4

3e-4

1e-3

EPOCH

EPOCH

VALIDATION LOSS

VALIDATION LOSS

FLAN-T5-LARGE

FLAN-T5-XL

BASE

MODEL VALIDATION

VALIDATION LOSS

1e-5

2e-4

3e-4

1e-3

EPOCH

EPOCH

VALIDATION LOSS

VALIDATION LOSS

FLAN-T5-LARGE

FLAN-T5-XL

BASE

MODEL VALIDATION

VALIDATION LOSS

1e-5

2e-4

3e-4

1e-3

EPOCH

EPOCH

VALIDATION LOSS

VALIDATION LOSS

FLAN-T5-LARGE

FLAN-T5-XL

BASE

MODEL VALIDATION

VALIDATION LOSS

An overview of the model that generated the most consistent questions

FINAL VERIDICT

bEST MODEL FULL METRICS

HUMAN EVALUATION OF PERFORMANCE & FINDINGS

05

MODEL EVALUATION & FINDINGS

TRAINING LOSS

0.8

VALIDATION LOSS

0.8

LLM Generated Question:What type of lawsuit is being filed by Apple vs. Amazon.com?

Expected Question (From Dataset):What is the primary claim made by Apple Against Amazon?

Context (From Dataset):Input: Answer: Trademark infringement/dilution and false advertising Context: This is a trademark infringement/dilution and false advertising case. Plaintiff Apple Inc. (“Apple”) alleges that defendant Amazon.com Inc. (“Amazon”) has been improperly using the term “APP STORE” in connection with sales of apps for Android devices and the Kindle Fire (Amazon’s tablet computer).

100%

QUESTIONRELIABILITY

90%

F1

94%

RECALL

87%

PRECISION

MODEL EVALUATION

FLAN-T5-LARGE: EPOCH 3

Performance on outside instances (DEMO)

Here are some of ours:

Fools Gold

Small datasets can mislead model bias-variance evaluation

Split Decisions

Model sensitive to split choice

Precision Postponed

Minor impact on small datasets, better for final optimization efforts.

Ignorance (is not) Bliss

Higher rates improve robustness

Training Trifecta

Context, question, answer integration essential in fine-tuning.

Less is More

Larger models need more data, risk overfitting.

"The power of data lies not in its volume, but in its interpretation."

tech report

RECOMMENDATIONS FOR IMPROVEMENTS

06

NEXT STEPS

Optimize questions for added complexity and sophistication, through increase in token limit, and advancing algorithms

INCREASE COMPLEXITY

Add answering capability to LLM for comprehensive, practical use for complete end to end usage.

INCORPORATE ANSWERING

Evolve text-input tool to support various formats, batch processing, and LLM integration.

INCREASE TOOL VERSATILITY

Expand manual dataset for better performance, prioritizing data over increasing model size.

EXPAND THE DATASET

Revise two-step fine-tuning with legal dataset and manual labels, adding intermediate knowledge step.

INTERMEDIARY FINE-TUNING

RECOMMENDATIONS FOR ADDED PROGRESS

OPEN FLOOR FOR DISCUSSION

07

Questions & Answering

WRITE ATITLE HERE

Lorem ipsum dolor sit amet, consectetuer adipiscing

Lorem ipsum dolor sit amet, consectetuer adipiscing

+ 45k

+ 85k

Lorem ipsum dolor sit amet, consectetuer adipiscing

Lorem ipsum dolor sit amet, consectetuer adipiscing

+ 12k

+ 190

Use tables and infographics

Disciplines such as ‘Visual Thinking’ facilitate the taking of visually rich notes through the use of images, graphs, infographics, and simple drawings.

23K

ProjectNetwork

An Overview of our Partner and Team

TITLE YOUR SECTION HERE

Write a subtitle here

패션

+ Info

WRITE ATITLE HERE

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam eget nisl dictum, blandit est id, bibendum lacus. Sed mi odio, ullamcorper et eros eu, ferment pulvinar tellus. Nullam porttitor dolor vel posuere pulvinar. Ut a aliquam metus. Proin maximus felis augue, at accumsan felis fringilla nec. Morbi a tempor sapien. Nullam volutpat turpis dui, luctus molestie arcu faucibus id.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetur elit. Donec elementum metus auctor metus varius pulv. Donec finibus faucibus justo, id bibendum arcu accumsan in. Sed rhoncus, sapien eget laoreet congue, ante mi aliquet lorem, sit amet egestas magna leo.

WRITE A TITLE HERE

패션

Lorem ipsum dolor sit amet, consectetur adipiscing.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit

WRITE ATITLE HERE

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.Lorem ipsum dolor sit amet, consectetuer adipiscing elit. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

WRITE A TITLE HERE

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris sodales id elit et.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris sodales id elit et.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris sodales id elit et.

+ Info

+ Info

+ Info

Euismod tincidunt ut laoreet

Euismod tincidunt ut laoreet

Euismod tincidunt ut laoreet

Euismod tincidunt ut laoreet

Euismod tincidunt ut laoreet

Euismod tincidunt ut laoreet

WRITE A TITLE HERE

author's name

" Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh euismod tincidunt ut laoreet dolore magna aliquam erat volutpat "

Sed posuere nunc vel arcu auctor consequat. Vestibulum vehicula mi eget mauris dignissim scelerisque. Nunc luctus imperdiet nibh a feugiat. Cras egestas suscipit odio, vel laoreet diam molestie ac. Fusce laoreet quam lorem, eu commodo sapien ullamcorper sed.Cras in leo interdum sem vehicula dignissim et accumsan orci. Curabitur placerat auctor consequat. Quisque posuere est nisl, vitae auctor ante.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur ac nunc in sapien euismod ornare. Nunc maximus nec risus ac tincidunt. Phasellus egestas nec arcu eget auctor. Maecenas commodo nulla risus, in dignissim nisl tristique id. Maecenas consectetur pharetra lorem non placerat.

write atitle here

20XX

20XX

20XX

20XX

20XX

Lorem ipsum dolor sit amet consequiat

Lorem ipsum dolor sit amet consequiat

Lorem ipsum dolor sit amet consequiat

Lorem ipsum dolor sit amet consequiat

Lorem ipsum dolor sit amet consequiat

Timeline

WRITE ATITLE HERE

Lorem ipsum dolor sit amet, consectetuer adipiscing elit, sed diam nonummy nibh.

25%

65%

Lorem ipsum dolor sit amet

+ Info

VS

  1. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
  2. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
  3. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
  4. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

WRITE ATITLE HERE

  1. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
  2. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
  3. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.
  4. Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

WRITE ATITLE HERE

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

Lorem ipsum dolor sit amet, consectetuer adipiscing elit.

WRITE A TITLE HERE

VS

WRITE ATITLE HERE

Lorem ipsum dolor sit amet, consectetuer adipiscing

Lorem ipsum dolor sit amet, consectetuer adipiscing

+ 45k

+ 85k

Lorem ipsum dolor sit amet, consectetuer adipiscing

Lorem ipsum dolor sit amet, consectetuer adipiscing

+ 12k

+ 190

WRITE A TITLE HERE

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam eget nisl dictum, blandit est id, bibendum lacus. Sed mi odio, ullamcorper et eros eu, fermentum pulvinar tellus. Nullam porttitor dolor vel posuere pulvinar. Ut a aliquam metus. Proin maximus felis augue, at accumsan felis fringilla nec.

+ Info

thanks!

패션

Got an idea?

Let the communication flow!

With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.

  • Generate experiences with your content.
  • It’s got the Wow effect. Very Wow.
  • Make sure your audience remembers the message.

Got an idea?

Let the communication flow!

With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.

  • Generate experiences with your content.
  • It’s got the Wow effect. Very Wow.
  • Make sure your audience remembers the message.

Got an idea?

Let the communication flow!

With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.

  • Generate experiences with your content.
  • It’s got the Wow effect. Very Wow.
  • Make sure your audience remembers the message.

Got an idea?

Let the communication flow!

With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.

  • Generate experiences with your content.
  • It’s got the Wow effect. Very Wow.
  • Make sure your audience remembers the message.

Got an idea?

Let the communication flow!

With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.

  • Generate experiences with your content.
  • It’s got the Wow effect. Very Wow.
  • Make sure your audience remembers the message.

Got an idea?

Let the communication flow!

With Genially templates, you can include visual resources to wow your audience. You can also highlight a particular sentence or piece of information so that it sticks in your audience’s minds, or even embed external content to surprise them: Whatever you like!Do you need more reasons to create dynamic content? No problem! 90% of the information we assimilate is received through sight and, what’s more, we retain 42% more information when the content moves.

  • Generate experiences with your content.
  • It’s got the Wow effect. Very Wow.
  • Make sure your audience remembers the message.

Justification

• Original Model Size (without quantization):• Each parameter uses 32 bits or 4 bytes. • For 3 billion parameters: 3B parameters × 4 bytes/parameter = 12B bytes. • 1 GB = (1.074B bytes) • the original size is 12B bytes / 1.074B bytes/GB ≈ 11.18 GB, rounded to 12 GB. • Reduced Model Size with 4-bit Quantization: • 8 bits = 1 byte (by definition) • 4 bits = 1 byte / 2 • 4 bits = 0.5 bytes • Quantization reduces each parameter to 4 bits, or 0.5 bytes. • For 3 billion parameters: 3B parameters × 0.5 bytes/parameter = 1.5B bytes. • The quantized size is 1.5B bytes / 1.074B bytes/GB ≈ 1.4 GB, approximated to 1.5 GB.