AI2 Introduces Tango, A Python Library For Choreographing Machine Learning Research Experiments By Executing A Series Of Steps

On Jul 2, 2022

Active research projects frequently devolve into a jumble of files with varying degrees of descriptive names processed by Python programs and bound together by Bash scripts. People can never be entirely sure that they can actually repeat a result since intermediate outcomes disappear or become difficult to locate.

Tango ensures you never operate on outdated data by taking care of your intermediate and final outcomes and finding them again when needed.

What does that actually mean?

Tango has a lot of capabilities, but its main feature is this:

Tango caches function results even if your process is restarted. If one merely takes advantage of one function, Tango can significantly benefit you.

Perhaps your codebase has a method that trains a model:

def train_model(
    model_name: str,
    dataset_name: str,
    lr: float = 1e-5,
    ...
) -> torch.nn.Module:
    ... # runs for six hours
    return model
# later ...
model = train_model(model_name="gpt2", dataset_name="squad_v2", ...)

Using Tango, you would wrap that model in a Step:

class TrainModelStep(Step):
    def run(
        model_name: str,
        dataset_name: str,
        lr: float = 1e-5,
        ...
    ) -> torch.nn.Module:
        ... # runs for six hours
        return model
# later ...
model_step = TrainModelStep(
    model_name="gpt2",
    dataset_name="squad_v2",
    ...)
model = model_step.results()

When one is set up in this way, the model will be prepared the first time one executes the model step.results(). Because it is cached in memory, one immediately receives the model the second time you call it.

That isn’t really helpful in this situation. It becomes beneficial when one saves the cache on a disc rather than in memory. They must use the Workspace, another Tango idea, to do that.

workspace = tango.Workspace.from_url("path/to/workspace")
model_step = TrainModelStep(model_name="gpt2", dataset_name="squad_v2", ...)
model = model_step.results(workspace)

Now, it will train the model the first time you run this. The second time you run this, it will find the model in the workspace and read it from there.

The first time you execute this, the model will be trained. The model will be located in the workspace and read from there the second time this is completed.

How does this cool?

With this capacity at your disposal, you may modify how you write your experiments. You’re presumably now chaining many stages together using bash scripts:

A single Python command for model training.

Add one more Python command to adjust it.

A third is to assess the outcome.

pretrained_model_step =
    TrainModelStep(model="gpt2", dataset="squad_v2", ...)
pretrained_model = pretrained_model_step.result(workspace)
finetuned_model_step =
    TrainModelStep(model=pretrained_model, dataset="drop", ...)
finetuned_model = finetuned_model_step.result(workspace)
evaluation_step =
    EvaluateModelStep(model=finetuned_model, dataset="mrqa")
evaluation = evaluation_step.result(workspace)

The portions already in the cache will be automatically skipped by Tango. You may improve your procedures and ensure you never unintentionally use outdated data. You never misplace intermediate results and don’t need a complicated naming strategy to keep them organized.

However, writing all those additional.result() lines are tedious, so we have a built-in shortcut. You may move immediately from one step to the next, and Tango takes care of it:

pretrained_model_step =
    TrainModelStep(model="gpt2", dataset="squad_v2", ...)
finetuned_model_step =
    TrainModelStep(model=pretrained_model_step, dataset="drop", ...)
evaluation_step =
    EvaluateModelStep(model=finetuned_model_step, dataset="mrqa")
evaluation = evaluation_step.result(workspace)

How does the Tango choose when to run a situation?

Tango records a Step results under a name that is a mix of the step’s name and the call’s arguments. The names appear as follows:TrainModelStep-34AiXoyiPKADMUnhcBzFYd6JeMcgx4DP. It reads the results from the workspace if it detects the produced name already exists. The step is executed if the created name does not already exist. The name of the action changes when the parameters one calls it to change, and the step will run as a result.

Tango is intended to be used in a dynamic environment where any experiment component or piece of code might change at any time. How does Tango recognize that a step must be re-run when the code for the step is changed? The solution is: You must inform it by designating a version for the phase. Increase the version number when making significant code changes:

class TrainModelStep(Step):
    VERSION = "002"
    def run(
        model_name: str,
        dataset_name: str,
        lr: float = 1e-5,
        ...
    ) -> torch.nn.Module:
        ... # runs for six hours
        return model

How does the Tango determine how to serialize the outcome of a step?

By default, Tango serializes and deserializes data using the well-known dill library. This is not the best option for many outcomes, such as PyTorch models or massive datasets. They provide a method to alter the serialization format to address this issue. One might wish to utilize torch’s serialization techniques for a PyTorch model:

class TrainModelStep(Step):
    FORMAT = tango.integrations.torch.TorchFormat()
    def run(
        model_name: str,
        dataset_name: str,
        lr: float = 1e-5,
        ...
    ) -> torch.nn.Module:
        ... # runs for six hours
        return model

Tango takes extra effort to ensure that iterators can be serialized and deserialized effectively. When using yield instead of return in your run()function, Tango detects this and handles it during serialization. If such is the case, an effort is taken to ensure that the result is never read in full, protecting memory and minimizing latency.

SqliteSequenceFormat is even a custom format for run() methods that produce lists of items. It ensures they don’t have to read the complete list into memory before deserialization. Similar to the specific processing for the iterator, this reduces latency and frees up memory.

Source: https://blog.allenai.org/keeping-experiments-straight-with-tango-bbe856ac01eb

Regarding Workspaces

Keep in mind this line of code.

workspace = tango.Workspace.from_url("path/to/workspace")

It’s cool enough to store a workspace on a local disc. Even better, one can utilize a path on an NFS disc to access the same workspace from other machines.

However, the research team can improve: You may utilize WandbWorkspace if you use Weights & Biases:

workspace = tango.Workspace.from_url("wandb://<username>/<project>")

As a result, any computer with access to W&B may view your results and cache.

This Article is written as a summary article by Marktechpost Staff based on the research article 'Keeping experiments straight with Tango'. All Credit For This Research Goes To Researchers on This Project. Checkout the reference post, github and documentation.

Please Don't Forget To Join Our ML Subreddit

Credit: Source link