diff --git a/docs/source/augmentation_tutorials/adversarial.rst b/docs/source/augmentation_tutorials/adversarial.rst new file mode 100644 index 00000000..006d7677 --- /dev/null +++ b/docs/source/augmentation_tutorials/adversarial.rst @@ -0,0 +1,47 @@ +.. _adversarial_human_like_augmentation: + +Adversarial human-like augmentation +==================================== + +This tutorial covers :py:class:`autointent.generation.utterances.HumanUtteranceGenerator` together with :py:class:`autointent.generation.utterances.CriticHumanLike`. The generator proposes paraphrases of training utterances; the critic asks an LLM to label each candidate as ``human`` or ``generated``. Candidates classified as ``generated`` are rejected and refined in a loop until the critic accepts them (or retries are exhausted). + +.. warning:: + + This path is **experimental** and may hurt data quality if the critic or base model mis-judges natural text. Use small ``n_final_per_class`` values first and inspect outputs. + +How it fits together +-------------------- + +- **Generator** — :py:class:`autointent.generation.Generator` wraps your chat/structured-output API (OpenAI-compatible). +- **CriticHumanLike** — builds a JSON-schema prompt so the LLM returns ``reasoning`` and ``label`` (``human`` \| ``generated``); :py:meth:`~autointent.generation.utterances.CriticHumanLike.is_human` returns whether the utterance passed. +- **HumanUtteranceGenerator** — orchestrates rewrite attempts per intent; :py:meth:`~autointent.generation.utterances.HumanUtteranceGenerator.augment` can append accepted samples back into a chosen split (default: train). + +Installation +------------ + +Install the OpenAI-backed generator extra (the ``Generator`` wrapper loads the OpenAI client): + +.. code-block:: bash + + pip install "autointent[openai]" + +Set ``OPENAI_API_KEY`` (and optional base URL) as required by your deployment. No separate DSPy extra is needed for this augmentation path. + +Minimal sketch +-------------- + +.. code-block:: python + + from autointent import Dataset + from autointent.generation import Generator + from autointent.generation.utterances import CriticHumanLike, HumanUtteranceGenerator + + dataset = Dataset.from_dict({...}) # your train split, with intent names if you use them in prompts + + llm = Generator(model_name="gpt-4o-mini") + critic = CriticHumanLike(generator=llm) + augmenter = HumanUtteranceGenerator(generator=llm, critic=critic, async_mode=False) + + new_samples = augmenter.augment(dataset, split_name="train", n_final_per_class=3) + +See the API reference for full argument lists (:py:class:`~autointent.generation.utterances.HumanUtteranceGenerator`, :py:class:`~autointent.generation.utterances.CriticHumanLike`). diff --git a/docs/source/augmentation_tutorials/index.rst b/docs/source/augmentation_tutorials/index.rst index 38280bda..dcc3fea7 100644 --- a/docs/source/augmentation_tutorials/index.rst +++ b/docs/source/augmentation_tutorials/index.rst @@ -8,4 +8,5 @@ Data augmentation tutorials balancer dspy_augmentation + adversarial intent_description diff --git a/docs/source/concepts.rst b/docs/source/concepts.rst index 2e4816a3..caa99734 100644 --- a/docs/source/concepts.rst +++ b/docs/source/concepts.rst @@ -85,6 +85,9 @@ A critical capability for production text classification systems, especially in **🔗 Integration with Multi-Label** OOS detection works seamlessly with multi-label scenarios, enabling detection of completely unknown inputs vs. partial matches to known classes. +**🧭 Split handling** + When splits contain OOS samples (``label is None``), the data handler keeps scoring stages on in-domain rows only: in hold-out mode it can duplicate affected splits into ``{split}_0`` (OOS removed for scoring) and ``{split}_1`` (full data for decision) when ``separation_ratio`` is not configured, and cross-validation similarly drops OOS from training folds used while scoring. Before fitting, you can validate whether your data supports splitting with :py:func:`autointent.context.data_handler.check_split_readiness`. + .. _concepts-presets: Optimization Presets diff --git a/user_guides/basic_usage/03_automl.py b/user_guides/basic_usage/03_automl.py index 909a5d4d..29a3f5ac 100644 --- a/user_guides/basic_usage/03_automl.py +++ b/user_guides/basic_usage/03_automl.py @@ -52,6 +52,8 @@ # %% [markdown] """ +The same preset can also be loaded as a typed %mddoclink(class,,OptimizationConfig) via ``OptimizationConfig.from_preset("classic-light")`` and passed to %mddoclink(method,Pipeline,from_optimization_config) when you want a validated configuration object instead of editing the raw dict from ``load_preset``. + You can inspect the structure and default values of any preset: """ @@ -77,7 +79,7 @@ # %% [markdown] """ -See tutorial %mddoclink(notebook,advanced.03_search_space_configuration) on how the search space is structured. +See tutorial %mddoclink(notebook,advanced.03_automl) on how the search space is structured. """ # %% [markdown]