The price of a car
In this example we consider a simplified price calculation for a car. The example some basic hubit
features and shows three different ways of implementing the car prices calculation. To a large extent, the example follows examples/car
. The example will be explained and some key Hubit
terminology will be introduced.
In the example let us imagine that we are calculating the price of a car based on the names of the individual parts. So the calculation involves a lookup of the price for each part and a summation of the parts prices.
Components
First, your existing tools each need to be wrapped as a Hubit
component. A Hubit
component is a computational task that has bindings to the input data and to the results data. The bindings define which attributes the component
- consumes from the shared input data structure,
- consumes from the shared results data structure, and
- provides to the shared results data structure.
From the bindings Hubit
can check that all required input data and results data is available before the computational task is executed. The bindings are defined in a model configuration file and are passed to the component entrypoint function.
Component entrypoint function
Below you can see some pseudo code for the calculation for the car price calculation. The example is available in mod1_cmp1.py
def price(_input_consumed, results_provided):
# Extract required input data here
counts = _input_consumed['part_counts']
names = _input_consumed['part_names']
# Look up the price of the part based on the part name (local data, database)
unit_prices = [my_lookup(name) for name in names]
# Compute results here (web service, C, Python ...)
result = sum( [ count*unit_price
for count, unit_price in zip(counts, unit_prices) ] )
results_provided['car_price'] = result
The entrypoint function in a component (price
in the example above) should expect the arguments _input_consumed
and results_provided
in that order. Results data calculated in the components should only be added to the latter. The values stored in the keys part_counts
and part_names
in _input_consumed
are controlled by the bindings in the model configuration file.
Component bindings
Before we look at the bindings let us look at the input data. The input can, like in the example below, be defined in a yml file. In the car example the input data is a list containing two cars each with a number of parts
cars:
- parts:
- count: 4
name: wheel1
- count: 1
name: chassis1
- count: 2
name: bumper
- count: 1
name: engine1
- count: 1
name: radio
- parts:
- count: 4
name: wheel2
- count: 1
name: chassis2
- count: 2
name: bumper
- count: 1
name: engine14
The price
entrypoint function above expects a list of part names stored in the field part_names
and a list of the corresponding part counts stored in the field part_counts
. To make such lists available for the entrypoint function, the model configuration file should contain the lines below.
consumes_input:
- name: part_names # key in _input_consumed exposed to the entrypoint function
path: cars[IDX_CAR].parts[:@IDX_PART].name # path in input data
- name: part_counts
path: cars[IDX_CAR].parts[:@IDX_PART].count
The strings in square braces are called index specifiers. The index specifier :@IDX_PART
refers to all items for the index identifier IDX_PART
. The index specifier IDX_CAR
is simply refers to a specific car. The index identifiers (here IDX_PART
and IDX_CAR
) are identification strings that the user can choose with some limitations.
With the input data and bindings shown above, the content of _input_consumed
in the price
function for the car at index 1 will be
{
'part_counts': [4, 1, 2, 1],
'part_names': ['wheel2', 'chassis2', 'bumper', 'engine14']
}
i.e. the component's entrypoint function will have all counts and part names for a single car in this case the car at index 1 available in the input.
In the last line of the price
function, the car price is added to the results
results_provided['car_price'] = result
To enable the transfer of the calculated car price to the correct path in the shared results data object we must add a binding for the name car_price
. If, for example, we want to store the car price in a field called price
at the same car index as where the input data was taken from, the binding below should be added to the model file.
provides_results:
- name: car_price # internal name (key) in results_provided
path: cars[IDX_CAR].price # path in the shared results data
The index specifier IDX_CAR
in the binding path tells Hubit
to store the car price at the same car index as where the input was taken from. Note that the component itself is unaware of which car (car index) the input represents.
Collecting the bindings we get
provides_results:
- name: car_price # internal name in the component
path: cars[IDX_CAR].price # path in the shared data
consumes_input:
- name: part_name
path: cars[IDX_CAR].parts[:@IDX_PART].name
- name: part_counts
path: cars[IDX_CAR].parts[:@IDX_PART].count
Read more about paths, index specifiers and index identifiers in the documentation.
Tips on refactoring
The flexibility of the Hubit
binding paths allows you to match the interfaces of your existing tools. Further, this flexibility enables you to refactor to get good modularity and optimize for speed when multi-processing is used. Below we will show three versions of the car model and outline some key differences when multi-processing is used.
Car model 0
In model0.yml
the price calculation receives an entire car object at a specific car index (IDX_CAR
). This allows the component to store results data on the corresponding car index in the results data object that Hubit
creates.
provides_results:
- name: car_price
path: cars[IDX_CAR].price
consumes_input:
- name: car
path: cars[IDX_CAR]
This model allows queries such as cars[:].price
and cars[1].price
. If car objects in the input data only contains count
and name
(like in the example above) this simple model definition is more or less equivalent to the more elaborate model shown above. If, on the other hand, car objects in the input data contains more data this (irrelevant) data would be exposed to the price calculation function. Further, in the implementation of the car price calculation an undesirable tight coupling to the input data structure would be unavoidable. The entrypoint function could look something like this
counts, names = list(
zip(
*[(part["count"], part["name"]) for part in _input_consumed["car"]["parts"]]
)
)
unit_prices = [my_lookup_function(name) for name in names]
result = sum([count * unit_price for count, unit_price in zip(counts, unit_prices)])
results_provided["car_price"] = result
Notice how the parts
list and the count
and name
attributes are accessed directly on the car object leading to a tight coupling.
Car model 1
model1.yml
is the one described above model 0 where the car price is calculated in a single component i.e. in a single worker process. Such an approach works well if the lookup of parts prices is fast and the car price calculation is also fast. If, however, the lookup is fast while the car price calculation is slow, and we imagine that another component is also consuming the parts prices, then the car price calculation would be a bottleneck. In such cases, separating the lookup from the price calculation would probably boost performance. Models 2 and 3 present two different ways of implementing such a separation.
Car model 2
In model2.yml
the parts price lookup and the car price calculation are implemented in two separate components. Further, the component that is responsible for the price lookup retrieves the price for one part only. In other words, each lookup will happen in a separate (optionally asynchronous) worker process. When all the lookup processes are done, the price component sums the parts prices to get the total car price. The relevant sections of the model file could look like this
# price for one part
- consumes_input:
- name: part_name
path: cars[IDX_CAR].parts[IDX_PART].name
- name: part_count
path: cars[IDX_CAR].parts[IDX_PART].count
provides_results:
- name: part_price
path: cars[IDX_CAR].parts[IDX_PART].price
# car price from parts prices
- consumes_results:
- name: prices
path: cars[IDX_CAR].parts[:@IDX_PART].price
provides_results:
- name: car_price
path: cars[IDX_CAR].price
Notice that the first component consumes a specific part index (IDX_PART
) for a specific car index (IDX_CAR
). This allows the component to store results data on a specific part index for a specific car index. The entrypoint function for the first component (price for one part) could look something like this
def part_price(_input_consumed, results_provided):
count = _input_consumed['part_count']
name = _input_consumed['part_name']
results_provided['part_price'] = count*my_lookup_function(name)
The entrypoint function for the second component (car price) could look like this
def car_price(_input_consumed, results_provided):
results_provided['car_price'] = sum( _input_consumed['prices'] )
In this refactored model Hubit
will, when submitting a query for the car price using the multi-processor flag, execute each part price calculation in a separate asynchronous worker process. If the part price lookup is fast, the overhead introduced by multi-processing may be render model 2 less attractive. In such cases performing all the lookups in a single component, but still keeping the lookup separate from the car price calculation, as shown in car model 3, could be a good solution.
Car model 3
In model3.yml
all price lookups take place in one single component and the car price calculation takes place in another component. For the lookup component, the relevant sections of the model file could look like this
# price for all parts
consumes_input:
- name: parts_name
path: cars[IDX_CAR].parts[:@IDX_PART].name
- name: parts_count
path: cars[IDX_CAR].parts[:@IDX_PART].count
provides_results:
- name: parts_price
path: cars[IDX_CAR].parts[:@IDX_PART].price
Notice that the component consumes all part indices (:@IDX_PART
) for a specific car index (IDX_CAR
). This allows the component to store results data on all part indices for a specific car index. The entrypoint for the first component (price for all parts) could look something like this
def part_price(_input_consumed, results_provided):
counts = _input_consumed['parts_count']
names = _input_consumed['parts_name']
results_provided['parts_price'] = [count*my_lookup_function(name)
for count, name in zip(counts, names)
]
In this model, the car price component is identical to the one used in model 2 and is therefore omitted here.
Path to the entrypoint function
To tie together the bindings with the the Python code that does the actual work you need to add the path of the Python source code file to the model file. For the first car model it could look like this.
- path: ./components/price1.py
func_name: price
provides_results:
- name: car_price
path: cars[IDX_CAR].price
consumes_input:
- name: part_names
path: cars[IDX_CAR].parts[:@IDX_PART].name
- name: part_counts
path: cars[IDX_CAR].parts[:@IDX_PART].count
The specified path should be relative to model's
base_path
attribute, which defaults to the location of the model file when the model is initialized using the from_file
method. You can also use a dotted path e.g.
path: hubit_components.price1
is_dotted_path: True
where hubit_components
would typically be a package you have installed in site-packages.
Running
To get results from a model requires you to submit a query
. After Hubit
has processed the query the values of the queried attributes are returned in the response. A query may spawn many component workers that may each represent an instance of the same or different model components. Below are two examples of queries and the corresponding responses.
# Load model from file
hmodel = HubitModel.from_file('model1.yml', name='car')
# Load the input
with open(os.path.join(THISPATH, "input.yml"), "r") as stream:
input_data = yaml.load(stream, Loader=yaml.FullLoader)
# Set the input on the model object
hmodel.set_input(input_data)
# Query the model
query = ['cars[0].price']
response = hmodel.get(query)
The response looks like this
{'cars[0].price': 4280.0}
Is this case the parts prices will also be calculated by Hubit
to create the response. A query for parts prices for all cars looks like this
query = ['cars[:].parts[:].price']
response = hmodel.get(query)
and the corresponding response is
{
'cars[:].parts[:].price':
[
[480.0, 1234.0, 178.0, 2343.0, 45.0],
[312.0, 1120.0, 178.0, 3400.0]
]
}
Rendering
If Graphviz is installed Hubit
can render models and queries. In the example below we have rendered the query [cars[0].price]
i.e. the price of the car at index 0.
The graph illustrates nodes in the input data structure, nodes in the the results data structure, the calculation components involved in creating the response as well as hints at which attributes flow in and out of these components. The triple bar icon ≡ indicates that the node is accessed by index and should therefore be a list. The graph was created using the command below.
query = ['cars[0].price']
hmodel.render(query)
Validation
Running
hmodel.validate()
will validate various aspects of the model.
Running
hmodel.validate(['cars[0].price'])
will validate various aspects of the query.
Caching
Model-level caching
By default Hubit
never caches results internally. A Hubit
model can, however, write results to disk automatically by using the set_model_caching
method to set the caching level. Results caching is useful when you want to avoid spending time calculating the same results multiple times or to have Hubit
create restart snapshots. The table below comes from printing the log after running model 2 with and without model-level caching
print(hmodel.log())
--------------------------------------------------------------------------------------------------
Query finish time Query took (s) Worker name Workers spawned Component cache hits
--------------------------------------------------------------------------------------------------
21-Mar-2021 20:46:31 0.1 car_price 0 0
part_price 0 0
21-Mar-2021 20:46:31 1.8 car_price 3 0
part_price 14 0
--------------------------------------------------------------------------------------------------
The second run (top) using the cache is much faster than the first run (bottom) that spawns 17 workers to complete the query.
The model cache can be cleared using the clear_cache
method on a Hubit
model. To check if a model has an associated cached result use has_cached_results
method on a Hubit
model. Cached results for all models can be cleared by using hubit.clear_hubit_cache()
.
Component-level caching
Component-level caching can be activated using set_component_caching. By default component-level caching is off. If component-level caching is on, the consumed data for all spawned component workers and the corresponding results will be stored in memory during execution of a query. If Hubit
finds that, in the same query, two workers refer to the same model component and the input data are identical, the second worker will simply use the results produced by the first worker. The cache is not shared between sequential queries to a model. Also, the component-level cache is not shared between the individual sampling runs using get_many
method.
The table below comes from printing the log after running car model 2 with and without component-level caching
print(hmodel.log())
--------------------------------------------------------------------------------------------------
Query finish time Query took (s) Worker name Workers spawned Component cache hits
--------------------------------------------------------------------------------------------------
21-Mar-2021 20:48:26 1.1 car_price 3 1
part_price 14 6
21-Mar-2021 20:48:25 1.8 car_price 3 0
part_price 14 0
--------------------------------------------------------------------------------------------------
The second run (top) uses component-caching and is faster than the first run (bottom). Both queries spawn 17 workers in order to complete the query, but in the case where component-caching is active (top) 7 workers reuse results provided by the remaining 10 workers.
For smaller jobs any speed-up obtained my using component-level caching cannot be seen on the wall clock when using multi-processing. The effect will, however, be apparent in the model log as seen above.