How-To Guide#
This guide will walk you through the most useful aspects of DaggerML, including creating and managing DAGs, using the API, and running tests.
Install#
In a virtual environment, install daggerml-cli
.
pip install daggerml-cli
In the desired environment, install daggerml
pip install daggerml
Initialize#
dml config user testi@testico
dml repo create my-repo
dml config repo my-repo
dml config branch main
dml status
CLI Usage#
dml --help
dml COMMAND --help
dml COMMAND SUBCOMMAND --help
[!TIP] Shell completion is available for bash/zsh via argcomplete.
Basic Usage#
Create a dag#
import daggerml as dml
print(f"{dml.__version__ = }")
dag = dml.new(name="dag_no_1", message="Example DAG creation")
dml.__version__ = '0.0.20'
Adding data to our dag#
We want to track some data flow, so let’s start with the simplest case possible: literals.
We can add a literal with dag.put
, and get the value back with node.value()
.
node0 = dag.put(3)
node1 = dag.put("example")
node0.value(), node1.value()
(3, 'example')
daggerml has native support for collections like lists, sets, and maps (dictionaries).
node2 = dag.put({"node0": node0, "node1": node1, "misc": [None, False]})
node3 = dag.put([node0, node1, node2])
node3.value()
[3, 'example', {'misc': [None, False], 'node0': 3, 'node1': 'example'}]
node3[1:]
Node(node/2644feee4da9e5b5976fb9c0fced00f1)
Note that we passed both python objects and dml.Node
objects to dag.put
, and the result was the same (from a data perspective). The difference is, if you pass a dml.Node
object, then we can add the corresponding edge in the dag (we can track that dependency).
Accessing data#
value#
Getting the value (as you saw above):
node3.value()
[3, 'example', {'misc': [None, False], 'node0': 3, 'node1': 'example'}]
Note
Calling dml.Node.value()
on a collection unrolls the data recursively
returning only python datastructures and dml.Resources
.
collections#
Collection nodes (like nodes 2 and 3) should be treated like the collections they are.
Tip
Collection elements should be accessed via the methods described here when feasible.
You can index into lists and maps and you get a node back.
node2["node0"], node3[0]
(Node(node/dc53dffc84e002ee7d9558e7a4a542d8),
Node(node/e1aed53c8161928dc3c93a34afee0843))
You can also get the keys of a dictionary (as a node):
node2.keys().value()
['misc', 'node0', 'node1']
You can get the length of any collection:
node2.len().value(), node3.len().value()
(3, 3)
Use them as iterators:
[node.value() for node in node2]
['misc', 'node0', 'node1']
[node.value() for node in node3]
[3, 'example', {'misc': [None, False], 'node0': 3, 'node1': 'example'}]
You can call
.items
for dictionaries.
A key thing to keep in mind is that when you index, you get a different node than the one you put in (but with the same underlying value). For example,
node0 = dag.put(1)
node1 = dag.put([node0])
We see that node1
is just the list containing node0
, so the values of node0
and node1[0]
should be the same. But the node IDs are not.
node0, node1[0]
(Node(node/02073f1b7467cc013302f343e254b4d0),
Node(node/387d6e8c9a253a526d1834a4d98ee7f5))
Committing results#
Now that we have some stuff, we might want to commit. So let’s say this dag was just to create this collection. Let’s commit it.
# we don't care about the return value yet.
_ = dag.commit(node3)
print(dag.dml("dag", "list"))
[{'nodes': ['17147b18bff6283c6836613f2e3d21b1', '6cc38c8cb58e9f0e69d7cb3d8949489b', 'e6bdbdd8282260b2cf7a65eee202656f', '5962d8c326b7da6ff464205175a21f43', '865e1e3671d77e4104a318df3d7e0170', 'e1aed53c8161928dc3c93a34afee0843', '5a38f243bdba144c16a7ba16e80c0782', '2644feee4da9e5b5976fb9c0fced00f1', '8cb93ece52acd62ab65f0017854bf462', '7dd0ef554f76b3872a76d598420bcef7', '6f8fd46484eab462e1fb251d439ff2e1', '5972a036dd24f3c7c54bcb5712002121', 'e9445d2d006ddecd870a70733e92d613', '08fe7a8fc32060653fa8908bb526be83', '6690a6189f2138f21eab2af71711b0a1', '0357707c7aeca1f6ca23392e9e479e89', '387d6e8c9a253a526d1834a4d98ee7f5', '9a945a3d20180bf6e5e9020aec584690', '401f8354bedb46df6979d04ed148a2cd', '3653b1339ddb76d6efbca0266f797b57', '251c79e0511ae74300ef325575d6dc7f', '9f5e704798b0161397c83f9d37ad369f', 'a2388fadfd456b8b591d0b3fcff6f1f2', 'd4c1bf3810c22bb45a9538940c05e976', '02073f1b7467cc013302f343e254b4d0', 'd2ec20d8a8afaad81ab198cca5455b25', 'f1f1b0661e77810d1d8772704ccb5ddd', 'dc53dffc84e002ee7d9558e7a4a542d8', '05f84d22fe91a6d0bafa1f5ec544b1bf', '3b5463d5eaa2d77ade2062a69bb0144e', '51f93385a2888d8051e76c22a696e8bc', '09d60c420e3afba85a715bbae23e24aa', '5993e4f56cf19d6b4ed2106eac0652c1', '615bddb065e02b6b55366597fb2b2247'], 'result': '401f8354bedb46df6979d04ed148a2cd', 'error': None, 'id': '3269e5b01517345290a44810d95c2079', 'name': 'dag_no_1'}]
Loading dag results#
When we committed the dag above, we made that value importable by any other dag. We added that dag to the working tree, and now others can see it.
dag = dml.new(name="dag_no_2", message="Second Dag")
node = dag.load("dag_no_1")
node.value()
[3, 'example', {'misc': [None, False], 'node0': 3, 'node1': 'example'}]
Real data#
In the real world we’re dealing with datasets on s3, or behind some data layer abstraction like snowflake, hive, or some bespoke software optimized for your company’s needs. We also deal with infrastructure that we spin up to test things. For these things daggerml has the concept of a dml.Resource
. It’s a datum type like int
, float
, string
, etc., but it represents a unique opaque blob.
rnode = dag.put(dml.Resource("my_ns:my_unique_id", data="asdf"))
resource = rnode.value()
resource
Resource(uri='my_ns:my_unique_id', data='asdf', adapter=None)
Exceptions#
To “fail” a dag, you just commit an instance of dml.Error
. The value is then a node that raises an error when you try to get its value.
dag = dml.new("failed-dag", "I'm doomed")
dag.commit(dml.Error("my unique error message"))
When we go to access it:
dag = dml.new("foopy", "gonna get an error")
node = dag.load("failed-dag")
node.value()
Error(message='my unique error message', context={}, code='Error')
Dags as context managers#
Tip
You can use a dag as a context manager to fail dags when an exception is thrown.
with dml.new("failed-dag2", "I'm doomed") as dag:
dag.put(1/0)
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
Cell In[18], line 2
1 with dml.new("failed-dag2", "I'm doomed") as dag:
----> 2 dag.put(1/0)
ZeroDivisionError: division by zero
The error was re-raised, but the dag still failed.
dag = dml.new("foopy", "gonna get an error")
node = dag.load("failed-dag2")
node.value()
Error(message='division by zero', context={'trace': ['Traceback (most recent call last):\n', ' File "/tmp/ipykernel_2174/3633103110.py", line 2, in <module>\n dag.put(1/0)\n ~^~\n', 'ZeroDivisionError: division by zero\n']}, code='ZeroDivisionError')
And we can see that the context manager kept the stack trace, which means now that stacktrace is stored in daggerml.
Using the API Class#
The [Api][api] class provides methods to interact with the DAGs. Here is an example of how to use the [Api]{.title-ref} class:
with dml.Dml() as api:
dag = api.new(name="example_dag", message="Example DAG creation")
import json
with dml.Dml() as api:
print(json.dumps(api("status"), indent=2))
{
"repo": "test",
"branch": "main",
"user": "test",
"config_dir": "/tmp/tmp34qn9uqn",
"project_dir": "/tmp/tmpast7vyyb",
"repo_path": "/tmp/tmp34qn9uqn/repo/test"
}