Beta: Firebase Genkit is in Beta, which means that it is not subject to any SLA or deprecation policy and could change in backwards-incompatible ways. Throughout the Beta period, Firebase Genkit and its documentation will be updated and improved.

本頁面由 Cloud Translation API 翻譯而成。

編寫 Genkit 評估工具

Firebase Genkit 可擴充至支援自訂測試案例輸出評估，方法是使用 LLM 做為評判標準，或純粹以程式輔助方式進行評估。

評估器定義

評估器是用來評估 LLM 提供和產生的內容的函式。自動評估 (測試) 的方法主要有兩種：啟發式評估和 LLM 評估。在啟發法中，您會定義確定性的函式，類似於傳統軟體開發函式。在 LLM 評估中，系統會將內容回饋至 LLM，並要求 LLM 根據提示中設定的條件評分輸出內容。

以 LLM 為基礎的評估工具

以 LLM 為基礎的評估工具會利用 LLM 評估生成式 AI 功能的輸入內容、背景資訊或輸出內容。

Genkit 中的 LLM 評估工具由 3 個部分組成：

提示
評分函式
評估工具動作

定義提示

在這個範例中，提示會要求 LLM 評估輸出內容的美味程度。首先，請向 LLM 提供脈絡，然後說明您希望它執行的操作，最後提供一些範例，讓模型根據這些範例回覆。

Genkit 的 definePrompt 公用程式可讓您輕鬆定義提示，並進行輸入和輸出驗證。以下說明如何使用 definePrompt 設定評估提示。

const DELICIOUSNESS_VALUES = ['yes', 'no', 'maybe'] as const;

const DeliciousnessDetectionResponseSchema = z.object({
  reason: z.string(),
  verdict: z.enum(DELICIOUSNESS_VALUES),
});
type DeliciousnessDetectionResponse = z.infer<typeof DeliciousnessDetectionResponseSchema>;

const DELICIOUSNESS_PROMPT = ai.definePrompt(
  {
    name: 'deliciousnessPrompt',
    inputSchema: z.object({
      output: z.string(),
    }),
    outputSchema: DeliciousnessDetectionResponseSchema,
  },
  `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.

  Examples:
  Output: Chicken parm sandwich
  Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }

  Output: Boston Logan Airport tarmac
  Response: { "reason": "Not edible.", "verdict": "no" }

  Output: A juicy piece of gossip
  Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }

  New Output:
  {{output}}
  Response:
  `
);

定義評分函式

現在，請定義函式，以便根據提示要求，採用包含 output 的範例，並評分結果。Genkit 測試案例將 input 列為必填欄位，並提供 output 和 context 的選填欄位。評估人員有責任驗證評估作業所需的所有欄位是否齊全。

import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

/**
 * Score an individual test case for delciousness.
 */
export async function deliciousnessScore<
  CustomModelOptions extends z.ZodTypeAny,
>(
  judgeLlm: ModelArgument<CustomModelOptions>,
  dataPoint: BaseEvalDataPoint,
  judgeConfig?: CustomModelOptions
): Promise<Score> {
  const d = dataPoint;
  // Validate the input has required fields
  if (!d.output) {
    throw new Error('Output is required for Deliciousness detection');
  }

  //Hydrate the prompt
  const finalPrompt = DELICIOUSNESS_PROMPT.renderText({
    output: d.output as string,
  });

  // Call the LLM to generate an evaluation result
  const response = await generate({
    model: judgeLlm,
    prompt: finalPrompt,
    config: judgeConfig,
  });

  // Parse the output
  const parsedResponse = response.output;
  if (!parsedResponse) {
    throw new Error(`Unable to parse evaluator response: ${response.text}`);
  }

  // Return a scored response
  return {
    score: parsedResponse.verdict,
    details: { reasoning: parsedResponse.reason },
  };
}

定義評估器動作

最後一步是編寫函式，定義評估器動作本身。

import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';

/**
 * Create the Deliciousness evaluator action.
 */
export function createDeliciousnessEvaluator<
  ModelCustomOptions extends z.ZodTypeAny,
>(
  judge: ModelReference<ModelCustomOptions>,
  judgeConfig: z.infer<ModelCustomOptions>
): EvaluatorAction {
  return defineEvaluator(
    {
      name: `myAwesomeEval/deliciousness`,
      displayName: 'Deliciousness',
      definition: 'Determines if output is considered delicous.',
    },
    async (datapoint: BaseEvalDataPoint) => {
      const score = await deliciousnessScore(judge, datapoint, judgeConfig);
      return {
        testCaseId: datapoint.testCaseId,
        evaluation: score,
      };
    }
  );
}

捷思評估工具

啟發式評估器可以是任何用於評估生成式 AI 功能輸入內容、內容或輸出內容的函式。

Genkit 中的啟發式評估工具由 2 個元件組成：

評分函式
評估工具動作

定義評分函式

就像是 LLM 評估器一樣，定義評分函式。在這種情況下，評分函式不需要知道評分 LLM 或其設定。

import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

const US_PHONE_REGEX =
  /^[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}$/i;

/**
 * Scores whether an individual datapoint matches a US Phone Regex.
 */
export async function usPhoneRegexScore(
  dataPoint: BaseEvalDataPoint
): Promise<Score> {
  const d = dataPoint;
  if (!d.output || typeof d.output !== 'string') {
    throw new Error('String output is required for regex matching');
  }
  const matches = US_PHONE_REGEX.test(d.output as string);
  const reasoning = matches
    ? `Output matched regex ${regex.source}`
    : `Output did not match regex ${regex.source}`;
  return {
    score: matches,
    details: { reasoning },
  };
}

定義評估器動作

import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';

/**
 * Configures a regex evaluator to match a US phone number.
 */
export function createUSPhoneRegexEvaluator(
  metrics: RegexMetric[]
): EvaluatorAction[] {
  return metrics.map((metric) => {
    const regexMetric = metric as RegexMetric;
    return defineEvaluator(
      {
        name: `myAwesomeEval/${metric.name.toLocaleLowerCase()}`,
        displayName: 'Regex Match',
        definition:
          'Runs the output against a regex and responds with 1 if a match is found and 0 otherwise.',
        isBilled: false,
      },
      async (datapoint: BaseEvalDataPoint) => {
        const score = await regexMatchScore(datapoint, regexMetric.regex);
        return fillScores(datapoint, score);
      }
    );
  });
}

設定

外掛程式選項

定義自訂評估器外掛程式要使用的 PluginOptions。這個物件沒有嚴格規定，且取決於定義的評估工具類型。

至少需要定義要註冊的指標。

export enum MyAwesomeMetric {
  WORD_COUNT = 'WORD_COUNT',
  US_PHONE_REGEX_MATCH = 'US_PHONE_REGEX_MATCH',
}

export interface PluginOptions {
  metrics?: Array<MyAwesomeMetric>;
}

如果這個新外掛程式使用 LLM 做為判斷機制，且外掛程式支援切換要使用的 LLM，請在 PluginOptions 物件中定義其他參數。

export enum MyAwesomeMetric {
  DELICIOUSNESS = 'DELICIOUSNESS',
  US_PHONE_REGEX_MATCH = 'US_PHONE_REGEX_MATCH',
}

export interface PluginOptions<ModelCustomOptions extends z.ZodTypeAny> {
  judge: ModelReference<ModelCustomOptions>;
  judgeConfig?: z.infer<ModelCustomOptions>;
  metrics?: Array<MyAwesomeMetric>;
}

外掛程式定義

外掛程式會透過專案中的 genkit.config.ts 檔案註冊至架構。如要設定新的外掛程式，請定義函式，以便定義 GenkitPlugin，並使用上述定義的 PluginOptions 進行設定。

在本例中，我們有兩個評估器 DELICIOUSNESS 和 US_PHONE_REGEX_MATCH。這就是這些評估工具註冊至外掛程式和 Firebase Genkit 的所在位置。

export function myAwesomeEval<ModelCustomOptions extends z.ZodTypeAny>(
  options: PluginOptions<ModelCustomOptions>
): PluginProvider {
  // Define the new plugin
  const plugin = (options?: MyPluginOptions<ModelCustomOptions>) => {
    return genkitPlugin(
    'myAwesomeEval',
    async (ai: Genkit) => {
      const { judge, judgeConfig, metrics } = options;
      const evaluators: EvaluatorAction[] = metrics.map((metric) => {
        switch (metric) {
          case DELICIOUSNESS:
            // This evaluator requires an LLM as judge
            return createDeliciousnessEvaluator(ai, judge, judgeConfig);
          case US_PHONE_REGEX_MATCH:
            // This evaluator does not require an LLM
            return createUSPhoneRegexEvaluator();
        }
      });
      return { evaluators };
    })
  }
  // Create the plugin with the passed options
  return plugin(options);
}
export default myAwesomeEval;

設定 Genkit

將新定義的外掛程式新增至 Genkit 設定。

如要使用 Gemini 進行評估，請停用安全設定，讓評估人員可以接受、偵測及評分潛在有害內容。

import { gemini15Flash } from '@genkit-ai/googleai';

const ai = genkit({
  plugins: [
    ...
    myAwesomeEval({
      judge: gemini15Flash,
      judgeConfig: {
        safetySettings: [
          {
            category: 'HARM_CATEGORY_HATE_SPEECH',
            threshold: 'BLOCK_NONE',
          },
          {
            category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
            threshold: 'BLOCK_NONE',
          },
          {
            category: 'HARM_CATEGORY_HARASSMENT',
            threshold: 'BLOCK_NONE',
          },
          {
            category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
            threshold: 'BLOCK_NONE',
          },
        ],
      },
      metrics: [
        MyAwesomeMetric.DELICIOUSNESS,
        MyAwesomeMetric.US_PHONE_REGEX_MATCH
      ],
    }),
  ],
  ...
});

測試

評估生成式 AI 功能輸出內容品質時，會遇到的所有問題，也同樣適用於評估 LLM 評估器的判斷能力。

如要瞭解自訂評估工具是否能達到預期水準，請建立一組有明確正確答案的測試案例。

舉例來說，美味度可能會像是 json 檔案 deliciousness_dataset.json：

[
  {
    "testCaseId": "delicous_mango",
    "input": "What is a super delicious fruit",
    "output": "A perfectly ripe mango – sweet, juicy, and with a hint of tropical sunshine."
  },
  {
    "testCaseId": "disgusting_soggy_cereal",
    "input": "What is something that is tasty when fresh but less tasty after some time?",
    "output": "Stale, flavorless cereal that's been sitting in the box too long."
  }
]

這些範例可以由人為產生，也可以請大型語言模型建立一組可編輯的測試案例。您也可以使用許多可用的基準資料集。

然後使用 Genkit CLI 針對這些測試案例執行評估工具。

genkit eval:run deliciousness_dataset.json

在 Genkit UI 中查看結果。

genkit start

前往 localhost:4000/evaluate。