Beta: Firebase Genkit is in Beta, which means that it is not subject to any SLA or deprecation policy and could change in backwards-incompatible ways. Throughout the Beta period, Firebase Genkit and its documentation will be updated and improved.

此页面由 Cloud Translation API 翻译。

编写 Genkit 评估器

Firebase Genkit 可以扩展为支持对测试用例输出进行自定义评估，方法是使用 LLM 作为评判者，或者完全以编程方式进行评估。

评估者定义

评估器是用于评估 LLM 提供和生成的内容的函数。自动评估（测试）主要有两种方法：启发词语评估和基于 LLM 的评估。在启发法中，您可以定义确定性函数，就像传统软件开发中的函数一样。在基于 LLM 的评估中，系统会将内容反馈给 LLM，并要求 LLM 根据提示中设置的条件对输出进行评分。

基于 LLM 的评估程序

基于 LLM 的评估器利用 LLM 来评估生成式 AI 功能的输入、上下文或输出。

Genkit 中基于 LLM 的评估器由 3 个组件组成：

提示
评分函数
评估器操作

定义提示

在此示例中，提示将要求 LLM 评判输出的美味程度。首先，向 LLM 提供上下文，然后说明您希望它执行的操作，最后，提供一些示例来作为其回答的基础。

Genkit 的 definePrompt 实用程序提供了一种简单的方法来定义包含输入和输出验证的提示。下面介绍了如何使用 definePrompt 设置评估提示。

const DELICIOUSNESS_VALUES = ['yes', 'no', 'maybe'] as const;

const DeliciousnessDetectionResponseSchema = z.object({
  reason: z.string(),
  verdict: z.enum(DELICIOUSNESS_VALUES),
});
type DeliciousnessDetectionResponse = z.infer<typeof DeliciousnessDetectionResponseSchema>;

const DELICIOUSNESS_PROMPT = ai.definePrompt(
  {
    name: 'deliciousnessPrompt',
    inputSchema: z.object({
      output: z.string(),
    }),
    outputSchema: DeliciousnessDetectionResponseSchema,
  },
  `You are a food critic. Assess whether the provided output sounds delicious, giving only "yes" (delicious), "no" (not delicious), or "maybe" (undecided) as the verdict.

  Examples:
  Output: Chicken parm sandwich
  Response: { "reason": "A classic and beloved dish.", "verdict": "yes" }

  Output: Boston Logan Airport tarmac
  Response: { "reason": "Not edible.", "verdict": "no" }

  Output: A juicy piece of gossip
  Response: { "reason": "Metaphorically 'tasty' but not food.", "verdict": "maybe" }

  New Output:
  {{output}}
  Response:
  `
);

定义评分函数

现在，定义一个函数，该函数将根据提示中的要求接受包含 output 的示例，并为结果评分。Genkit 测试用例将 input 作为必填字段，并为 output 和 context 添加了可选字段。评估者有责任验证评估所需的所有字段是否均已填写。

import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

/**
 * Score an individual test case for delciousness.
 */
export async function deliciousnessScore<
  CustomModelOptions extends z.ZodTypeAny,
>(
  judgeLlm: ModelArgument<CustomModelOptions>,
  dataPoint: BaseEvalDataPoint,
  judgeConfig?: CustomModelOptions
): Promise<Score> {
  const d = dataPoint;
  // Validate the input has required fields
  if (!d.output) {
    throw new Error('Output is required for Deliciousness detection');
  }

  //Hydrate the prompt
  const finalPrompt = DELICIOUSNESS_PROMPT.renderText({
    output: d.output as string,
  });

  // Call the LLM to generate an evaluation result
  const response = await generate({
    model: judgeLlm,
    prompt: finalPrompt,
    config: judgeConfig,
  });

  // Parse the output
  const parsedResponse = response.output;
  if (!parsedResponse) {
    throw new Error(`Unable to parse evaluator response: ${response.text}`);
  }

  // Return a scored response
  return {
    score: parsedResponse.verdict,
    details: { reasoning: parsedResponse.reason },
  };
}

定义评估器操作

最后一步是编写一个用于定义评估器操作本身的函数。

import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';

/**
 * Create the Deliciousness evaluator action.
 */
export function createDeliciousnessEvaluator<
  ModelCustomOptions extends z.ZodTypeAny,
>(
  judge: ModelReference<ModelCustomOptions>,
  judgeConfig: z.infer<ModelCustomOptions>
): EvaluatorAction {
  return defineEvaluator(
    {
      name: `myAwesomeEval/deliciousness`,
      displayName: 'Deliciousness',
      definition: 'Determines if output is considered delicous.',
    },
    async (datapoint: BaseEvalDataPoint) => {
      const score = await deliciousnessScore(judge, datapoint, judgeConfig);
      return {
        testCaseId: datapoint.testCaseId,
        evaluation: score,
      };
    }
  );
}

启发词语评估程序

启发词语评估器可以是用于评估生成式 AI 特征的输入、上下文或输出的任何函数。

Genkit 中的启发词语评估器由 2 个组件组成：

评分函数
评估器操作

定义评分函数

与基于 LLM 的评估器一样，定义评分函数。在这种情况下，评分函数无需了解评判 LLM 或其配置。

import { BaseEvalDataPoint, Score } from 'genkit/evaluator';

const US_PHONE_REGEX =
  /^[\+]?[(]?[0-9]{3}[)]?[-\s\.]?[0-9]{3}[-\s\.]?[0-9]{4}$/i;

/**
 * Scores whether an individual datapoint matches a US Phone Regex.
 */
export async function usPhoneRegexScore(
  dataPoint: BaseEvalDataPoint
): Promise<Score> {
  const d = dataPoint;
  if (!d.output || typeof d.output !== 'string') {
    throw new Error('String output is required for regex matching');
  }
  const matches = US_PHONE_REGEX.test(d.output as string);
  const reasoning = matches
    ? `Output matched regex ${regex.source}`
    : `Output did not match regex ${regex.source}`;
  return {
    score: matches,
    details: { reasoning },
  };
}

定义评估器操作

import { BaseEvalDataPoint, EvaluatorAction } from 'genkit/evaluator';

/**
 * Configures a regex evaluator to match a US phone number.
 */
export function createUSPhoneRegexEvaluator(
  metrics: RegexMetric[]
): EvaluatorAction[] {
  return metrics.map((metric) => {
    const regexMetric = metric as RegexMetric;
    return defineEvaluator(
      {
        name: `myAwesomeEval/${metric.name.toLocaleLowerCase()}`,
        displayName: 'Regex Match',
        definition:
          'Runs the output against a regex and responds with 1 if a match is found and 0 otherwise.',
        isBilled: false,
      },
      async (datapoint: BaseEvalDataPoint) => {
        const score = await regexMatchScore(datapoint, regexMetric.regex);
        return fillScores(datapoint, score);
      }
    );
  });
}

配置

插件选项

定义自定义评估器插件将使用的 PluginOptions。此对象没有严格的要求，并且取决于定义的评估器类型。

至少需要定义要注册的指标。

export enum MyAwesomeMetric {
  WORD_COUNT = 'WORD_COUNT',
  US_PHONE_REGEX_MATCH = 'US_PHONE_REGEX_MATCH',
}

export interface PluginOptions {
  metrics?: Array<MyAwesomeMetric>;
}

如果这个新插件使用 LLM 作为评判者，并且该插件支持切换要使用的 LLM，请在 PluginOptions 对象中定义其他参数。

export enum MyAwesomeMetric {
  DELICIOUSNESS = 'DELICIOUSNESS',
  US_PHONE_REGEX_MATCH = 'US_PHONE_REGEX_MATCH',
}

export interface PluginOptions<ModelCustomOptions extends z.ZodTypeAny> {
  judge: ModelReference<ModelCustomOptions>;
  judgeConfig?: z.infer<ModelCustomOptions>;
  metrics?: Array<MyAwesomeMetric>;
}

插件定义

插件通过项目中的 genkit.config.ts 文件向框架注册。如需能够配置新插件，请定义一个函数来定义 GenkitPlugin，并使用上面定义的 PluginOptions 对其进行配置。

在本例中，我们有两个评估器 DELICIOUSNESS 和 US_PHONE_REGEX_MATCH。此处是将这些评估器注册到插件和 Firebase Genkit 的位置。

export function myAwesomeEval<ModelCustomOptions extends z.ZodTypeAny>(
  options: PluginOptions<ModelCustomOptions>
): PluginProvider {
  // Define the new plugin
  const plugin = (options?: MyPluginOptions<ModelCustomOptions>) => {
    return genkitPlugin(
    'myAwesomeEval',
    async (ai: Genkit) => {
      const { judge, judgeConfig, metrics } = options;
      const evaluators: EvaluatorAction[] = metrics.map((metric) => {
        switch (metric) {
          case DELICIOUSNESS:
            // This evaluator requires an LLM as judge
            return createDeliciousnessEvaluator(ai, judge, judgeConfig);
          case US_PHONE_REGEX_MATCH:
            // This evaluator does not require an LLM
            return createUSPhoneRegexEvaluator();
        }
      });
      return { evaluators };
    })
  }
  // Create the plugin with the passed options
  return plugin(options);
}
export default myAwesomeEval;

配置 Genkit

将新定义的插件添加到 Genkit 配置中。

如需使用 Gemini 进行评估，请停用安全设置，以便评估者接受、检测和评分潜在有害内容。

import { gemini15Flash } from '@genkit-ai/googleai';

const ai = genkit({
  plugins: [
    ...
    myAwesomeEval({
      judge: gemini15Flash,
      judgeConfig: {
        safetySettings: [
          {
            category: 'HARM_CATEGORY_HATE_SPEECH',
            threshold: 'BLOCK_NONE',
          },
          {
            category: 'HARM_CATEGORY_DANGEROUS_CONTENT',
            threshold: 'BLOCK_NONE',
          },
          {
            category: 'HARM_CATEGORY_HARASSMENT',
            threshold: 'BLOCK_NONE',
          },
          {
            category: 'HARM_CATEGORY_SEXUALLY_EXPLICIT',
            threshold: 'BLOCK_NONE',
          },
        ],
      },
      metrics: [
        MyAwesomeMetric.DELICIOUSNESS,
        MyAwesomeMetric.US_PHONE_REGEX_MATCH
      ],
    }),
  ],
  ...
});

测试

评估生成式 AI 特征输出质量时遇到的问题同样适用于评估基于 LLM 的评估器的评判能力。

为了了解自定义评估器的效果是否达到预期，请创建一组具有明确正确答案和错误答案的测试用例。

以美味度为例，这可能类似于 JSON 文件 deliciousness_dataset.json：

[
  {
    "testCaseId": "delicous_mango",
    "input": "What is a super delicious fruit",
    "output": "A perfectly ripe mango – sweet, juicy, and with a hint of tropical sunshine."
  },
  {
    "testCaseId": "disgusting_soggy_cereal",
    "input": "What is something that is tasty when fresh but less tasty after some time?",
    "output": "Stale, flavorless cereal that's been sitting in the box too long."
  }
]

这些示例可以由人工生成，也可以让 LLM 帮助创建一组可管理的测试用例。您还可以使用许多可用的基准数据集。

然后，使用 Genkit CLI 针对这些测试用例运行评估器。

genkit eval:run deliciousness_dataset.json

在 Genkit 界面中查看结果。

genkit start

导航到 localhost:4000/evaluate。