DataWeave 2.2 and Apache Avro

June, 09 2019
Manik Magar
mule
dataweave2
avro
dataweave

1. Overview

Mule Runtime 4.2 was released with DataWeave 2.2 version. This adds support for Content (De)Serialization with Apache Avro. In this post, we will take a test drive of DataWeave 2.2 with Apache Avro.

Requirements:

Mule Runtime 4.2.0 and above
- DataWeave 2.2 (Uses Apache Avro 1.9.0)

2. Apache Avro

Apache Avro™ is a data serialization system. Some features Avro provides, are -

Rich Data Structures.
A compact, fast, binary data format.
A container file, to store persistent data.

2.1 Avro Schemas

Data format in Avro is described using Schemas. These schemas are defined in JSON. Schemas are needed when serializing data to Avro. When serialized, schema content is also included in serialized data. This makes it easy while deserializing the content, as required schema is locally present in data.

employee.avro.json: Example Employee Schema

{
	"namespace": "com.javastreets.avro",
	"name": "com.javastreets.avro.Employee",
	"type": "record",
	"fields": [
		{
			"name": "employeeId",
			"type": "int"
		},
		{
			"name": "firstname",
			"type": "string"
		},
		{
			"name": "lastname",
			"type": "string"
		},
		{
			"name": "address",
			"type": "string"
		},
		{
			"name": "notes",
			"type": "string"
		}
	]
}

3. DataWeave 2.2 Avro Support

DataWeave 2.2 adds support for Searializing and Deserializing data in Avro format. Fow DataWeave to use Avro (De)Serialization, Mime type of data must be application/avro.

3.1 Serialize (output) with Avro

To Serialize (output) data using Avro format, the output must be set to application/avro. Schemas are required for serialization. We use schemaUrl attribute on output to reference our schema file.

To serialize list of Employees using above schema, our DataWeave file would look like following -

DataWeave Avro Output

%dw 2.2
output application/avro schemaUrl="employee.avro.json" (1)
---
(0 to 100) map {
	employeeId: $$,
	firstname: "Manik" ++ $$,
	lastname: "Magar",
	address: "Test dummy address 123",
	notes: "some more information"
}

1	References `employee.avro.json` in `src/main/resources/` directory.

Body of the script DOES NOT contain any Avro-specific code. This makes script development experience exactly same as earlier for other formats.

In case of larger payloads, streaming can be enabled by adding deferred=true (default false), and optionally bufferSize (default 8192) attributes on output. For example, following output will enable streaming with provided buffer size -

output application/avro schemaUrl="employee.avro.json",deferred=true,bufferSize=8192

Serialized data from above DataWeave would look like below (Note that it starts with the actual schema definition) -

Serialization output

Objavro.schema�{"type":"record","name":"Employee","namespace":"com.javastreets.avro","fields":[{"name":"employeeId","type":"int"},{"name":"firstname","type":"string"},{"name":"lastname","type":"string"},{"name":"address","type":"string"},{"name":"notes","type":"string"}]}�f�m���(�%�g���_Manik0
Magar,Test dummy address 123*some more informationManik1
Magar,Test dummy address 123*some more informationManik2

3.2 Deserialize with Avro

Deserialization with Avro in DataWeave 2.2 is pretty straight forward. If payload is a data serialized with Avro, then payload mime type MUST be application/avro. Once you have that, DataWeave should be able to deserialize that content and allow you to convert it to any other format.

Consider following flow which reads a file, originally serialized with Avro.

Avro File reader flow

<flow name="test-avro-supportFlow1">
	<file:listener doc:name="On New or Updated File" directory="Documents/mule-avro/input" moveToDirectory="Documents/mule-avro/backup" autoDelete="true" outputMimeType="application/avro"> (1)
		<scheduling-strategy >
			<fixed-frequency />
		</scheduling-strategy>
		<file:matcher filenamePattern="*.avro"/> (2)
	</file:listener>
	<ee:transform doc:name="Transform Message">
		<ee:message >
			<ee:set-payload ><![CDATA[%dw 2.2
output application/json 	(3)
---
payload]]></ee:set-payload>
		</ee:message>
	</ee:transform>
	<file:write doc:name="Write" path="#['Documents/mule-avro/output/test.avro' ++ attributes.creationTime ++'.json']" /> (4)
</flow>

1	Listens for new files and sets mime type to be `application/avro`.
2	Reads `*.avro` files.
3	Converts file content (original Employee List, for example.) to JSON.
4	Writes json file to directory.

As per Avro specification, serialized data contains the schema definition. Reader is able to use that inline definition to deserialize data. Body of the script DOES NOT contain any Avro-specific code. This makes script development experience exactly same as for other formats.

This simple transformaton should generate JSON file like below -

Avro to JSON output

[
  {
    "employeeId": 0,
    "firstname": "Manik0",
    "lastname": "Magar",
    "address": "Test dummy address 123",
    "notes": "some more information"
  },
  {
    "employeeId": 1,
    "firstname": "Manik1",
    "lastname": "Magar",
    "address": "Test dummy address 123",
    "notes": "some more information"
  }
]

4. Conclusion

This post demonstrates how Apache Avro is used with DataWeave 2.2. We looked at writing (serializing) as well as reading (deserializing) data using Apache Avro. The demo source code is available on Github.

Follow @manikmagar on twitter to get updates on new posts.