Deserialization, normalization, validation and the JMS Serializer
Serializing can be a not trivial process, but when talking about de-serialization, the problem is even more difficult. The source data (XML or JSON) can be not well formatted, well formatted but not valid or valid syntactically but not valid in our domain. Dividing deserialization, normalization, validation is fundamental.
Recently, while speaking at the Symfony Berlin User Group, during the Q&A session, I was asked to comment on how to solve an issue when using FOS REST Bundle and JMS Serializer.
The JMS Serializer is a library that allows to serialize and deserialize and object graph into a JSON or XML representation and vice versa and in the last two years I'm doing my best as maintainer of it.
Serializing can be a not trivial process, but when talking about de-serialization, the problem is even more difficult. The source data (XML or JSON) can be not well formatted, well formatted but not valid or valid syntactically but not valid in our domain.
An example of this situation can be:
// blog post article
{
"author_id": 60,
"price": "123.540,50",
"date": "2010-02-05 10:20:60+02:00",
"text": "hello world"
}
Here we can ask our self:
- is
author_id = 60
a valid author id?- maybe is
"AUTHOR60"
instead of60
- maybe is
59
instead of60
- maybe is
- what is the value of
price
? - is the
date
"5th of February" or "2nd of May"? - is
text = "hello world"
a valid text?- maybe is too short?
- maybe is too long?
- maybe should be a valid HTML document?
This and many other question may arise from few JSON lines.
What we are talking about here is not de-serialization but we are talking about normalization and validation.
Normalization
Normalization is the process of translating the data from one representation to a different one (eventually more convenient for some reason).
As example:
12.4872,50
, 12 4872.50
, 124872,50
and 124872.50
may represent the same number, just expressed using different
localization standards.
Normalizing the previously said numbers means converting all of them into a single unified format that allows us to work on them.
Can be a float
number or a Money
value object or an integer
(multiplied by 100 in this case)
or any other representation fits well our domain model.
Note: normalization is about converting/translating data.
Validation
Validation is the process of asserting if a specified information is valid for a given context.
We can validate data either before or after the normalization phase.
Post Normalization validation
As example, given 124872.50
normalized as (float)
.
If interpreted as meters, this value might be valid if we are talking about distances, but most probably is not valid if we are talking about human height.
Another example can be the text hello world
normalized as (string)
, can be valid as "english language string"
but most probably is not valid as "blog post" where we might require at least 300 words.
In the post-normalization phase, the validity of the data depends on the application domain/context.
Pre Normalization validation
As example, given the 12.4872.50
information.
If we want to normalize it to a float
it is really difficult to decide which number it should represent.
Is it 124872.50(float)
or 12.487250(float)
or is just invalid?
If we had 12 4872.50
, representing it as float
is a bit easier,
but again, is it 12.0(float)
or124872.50(float)
or is just invalid?
In the case of pre-normalization, the validation process overlaps with what we are able to normalize in a reliable way (by reliability I mean the level of error I want to allow in my application).
- In applications where the data should be produced by other software is a good practice to have really strict validation and reduce the normalization to the minimum possible by being explicit to the maximum extend by which data representations are allowed.
- In applications where data are produced by humans the situation is more complex and it depends by how much frustration we want to inflict to the user when providing the data. As example in an e-commerce shop can be a good idea to "help" a bit the user by normalizing more compared to an application to fill a tax declaration form.
Note: validation does not perform any data conversion.
De-Serialization
De-serialization is the process of translating a representation that can be stored or transmitted into an object state.
Deserialization does not do explicitly validation or normalization, but in order to create the object state the "deserialization engine" can do implicitly some "pre-normalization validation" and normalization.
The deserializer will not do "post-normalization validation", this mean that your object graph might be in an invalid state after the deserialization process.
JMS Serializer
This post is mainly about PHP and more in detail about the JMS Serializer and its deserialization capabilities.
In the past different users have tried to enrich the JMS serializer ability to validate objects, but mixing the deserialization, validation and normalization is a risky idea. Is risky because can end up with a library not able to do well none of the three. Currently there are great alternatives when talking about validation or normalization.
Post-normalization validation
If you want to ensure a valid object state at the end of the deserialization process, a great solution is the Symfony Validator
$myObject = $serializer->deserialize('some json data here', 'MyObject', 'json');
$errors = $validator->validate($myObject);
if (count($errors)>0) {
// do something
} else {
// do something else :)
}
The valid object state can be described using XML, YAML and annotations when working with the Symfony Validator.
Pre-normalization validation and normalization
At the moment there is not out there a solution to do proper normalization and pre-normalization validation of the data using the JMS Serializer, but luckily there is another great solution, the Symfony Form component.
The Symfony form component allows you a much more granular process of "deserializing" some data into an object graph.
// just setup a fresh MyObject object (remove the dummy data)
$form = $this->createFormBuilder(new MyObject())
->add('task', TextType::class)
->add('dueDate', DateType::class)
->getForm();
$form->submit(json_decode('some json data here', true));
if ($form->isValid()) {
$myObject = $form->getData();
// do something
} else {
// do something else :)
}
The data types DateType
and TextType
(and many others built in into symfony) have plenty of useful options
and configurations that allows you to customize the de-serialization process. The form component can also set to NULL
all the fields before starting the "conversion" process or can do just simple update to your object state
(something that is not possible with the JMS serializer).
Compared to the JMS serializer, the symfony form deserialization process is much more complex and is not trivial to configure but it will do pre-normalization validation and normalization for you.
Conclusion
With this post I'm not saying "stop using the JMS Serializer" for deserialization, what I'm saying is "use the jms de-serializer in specific situations" where the validation and normalization are not fundamental.
My personal use-case for the JMS de-serializer is to exchange data between "Queue-workers". Data are produced by a php application (using the JMS serializer) and consumed by various PHP workers (data deserialized using the JMS de-serializer). In my case the data are really simple and produced by reliable sources. Because of that validation and normalization and validation are almost not necessary. (Some post-normalization validation is performed at application level to ensure the data received are valid, but not much)
My personal NOT-use-case for the JMS de-serializer is the handling of POST/PUT/PATCH requests in a REST webservice. The data that a public REST webservice should handle are often complex, especially if it is part of an API consumed by some user-facing frontend application. The API has to validate and normalize the user input as emails, date of birth, bank transaction amounts and so on...
What do you think? Is out there a library that handles properly all the three parts?
Looking forward to hear your feedback!