Strings
Wide and Wide-Wide Strings
We've seen many source-code examples so far that includes strings. In most of
them, we were using the standard string type: String
. This type is
useful for the common use-case of displaying messages or dealing with
information in plain English. Here, we define "plain English" as the use of the
language that avoids French accents or German umlaut, for example, and doesn't
make use of any characters in non-Latin alphabets.
There are two additional string types in Ada: Wide_String
, and
Wide_Wide_String
. These types are particularly important when dealing
with textual information in non-standard English, or in various other
languages, non-Latin alphabets and special symbols.
These string types use different bit widths for their characters. This becomes more apparent when looking at the type definitions:
type String is array(Positive range <>) of Character;
type Wide_String is array(Positive range <>) of Wide_Character;
type Wide_Wide_String is array(Positive range <>) of Wide_Wide_Character;
The following table shows the typical bit-width of each character of the string types:
Character Type |
Width |
---|---|
|
8 bits |
|
16 bits |
|
32 bits |
We can see that when running this example:
Let's look at another example, this time using wide strings:
Here, all strings (S
, WS
and WWS
) have the same length of
5 characters. However, the size of each character is different — thus,
each string has a different overall size.
The recommendation is to use the String
type when the textual
information you're processing is in standard English. In case any kind of
internationalization is needed, using Wide_Wide_String
is probably the
best choice, as it covers all possible use-cases.
In the Ada Reference Manual
Text I/O
Note that, in the previous example, we were using different versions of the
Ada.Text_IO
package depending on the string type we were using:
Ada.Text_IO
for objects ofString
type,Ada.Wide_Text_IO
for objects ofWide_String
type,Ada.Wide_Wide_Text_IO
for objects ofWide_Wide_String
type.
In that example, we were also using package renaming to differentiate among those packages.
Similarly, there are different versions of text I/O packages for individual
types. For example, if we want to display the value of a Long_Integer
variable based on the Wide_Wide_String
type, we can select the
Ada.Long_Integer_Wide_Wide_Text_IO
package. In fact, the list of
packages resulting from the combination of those types is quite long:
Scalar Type |
Text I/O Packages |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
Also, there are different versions of the generic packages Integer_IO
and Float_IO
:
Scalar Type |
Text I/O Packages |
---|---|
Integer types |
|
Real types |
|
Wide and Wide-Wide String Handling
As we've just seen, we have different versions of the Ada.Text_IO
package. The same applies to string handling packages. As we've seen in the
Introduction to Ada course,
we can use the Ada.Strings.Fixed
and Ada.Strings.Maps
packages
for string handling. For other formats, we have these packages:
Ada.Strings.Wide_Fixed
,Ada.Strings.Wide_Wide_Fixed
,Ada.Strings.Wide_Maps
,Ada.Strings.Wide_Wide_Maps
.
Let's look at this example from the Introduction to Ada course, which we adapted for wide-wide strings:
In this example, we're using the Find_Token
procedure to find the words
from the phrase stored in the S
constant. All the operations we're using
here are similar to the ones for String
type, but making use of the
Wide_Wide_String
type instead.
Bounded and Unbounded Wide and Wide-Wide Strings
We've seen in the
Introduction to Ada course
that other kinds of String
types are available. For example, we can
use bounded and unbounded strings — those correspond to the
Bounded_String
and Unbounded_String
types.
Those kinds of string types are available for Wide_String
, and
Wide_Wide_String
. The following table shows the available types and
corresponding packages:
Type |
Package |
---|---|
|
|
|
|
|
|
|
|
The same applies to text I/O for those strings. For the standard case, we have
Ada.Text_IO.Bounded_IO
for the Bounded_String
type and
Ada.Text_IO.Unbounded_IO
for the Unbounded_String
type.
For wider string types, we have:
Type |
Text I/O Package |
---|---|
|
|
|
|
|
|
|
|
Let's look at a simple example:
In this example, we're declaring a variable S
and initializing it with
the word "Hello." Then, we're concatenating it with " hello" and displaying it.
All the operations we're using here are similar to the ones for
Unbounded_String
type, but they've been adapted for the
Unbounded_Wide_Wide_String
type.
String Encoding
Unicode is one of the most widespread standards for encoding writing systems other than the Latin alphabet. It defines a format called Unicode Transformation Format (UTF) in various versions, which vary according to the underlying precision, support for backwards-compatibility and other requirements.
In the Ada Reference Manual
UTF-8 encoding and decoding
A common UTF format is UTF-8, which encodes strings using up to four (8-bit) bytes and is backwards-compatible with the ASCII format. While encoding of ASCII characters requires only one byte, Chinese characters require three bytes, for example.
In Ada applications, UTF-8 strings are indicated by using the
UTF_8_String
from the Ada.Strings.UTF_Encoding
package.
In order to encode from and to UTF-8 strings, we can use the Encode
and Decode
functions. Those functions are specified in the child
packages of the Ada.Strings.UTF_Encoding package. We select the appropriate
child package depending on the string type we're using, as you can see in the
following table:
Child Package of
|
Convert from / to |
---|---|
|
|
|
|
|
|
Let's look at an example:
In this application, we start by storing a string in Arabic in the
Hello_World_Arabic
constant. We then use the Decode
function to
convert that string from UTF_8_String
type to Wide_Wide_String
type — we store it in the WWS_Hello_World_Arabic
constant.
We use a variable of type Unbounded_Wide_Wide_String
(UWWS
) to
manipulate strings: we append the string in Arabic to the "Hello World: "
string and store it in UWWS
.
In the Show_WW_String
block, we convert the string — stored in
UWWS
— from the Unbounded_Wide_Wide_String
type to the
Wide_Wide_String
type and display the length and size of the string. We
do something similar in the Show_UTF_8_String
block, but there, we
convert to the UTF_8_String
type.
Also, in the Show_UTF_8_String
block, we use the Encode
function
to convert that string from Wide_Wide_String
type to then
UTF_8_String
type — we store it in the S_UTF_8
constant.
UTF-8 size and length
As you can see when running the last code example from the previous subsection, we have different sizes and lengths depending on the string type:
String type |
Size |
Length |
---|---|---|
|
832 |
26 |
|
296 |
37 |
The size needed for storing the string when using the Wide_Wide_String
type is bigger than the one when using the UTF_8_String
type. This is
expected, as the Wide_Wide_String
uses 32-bit characters, while the
UTF_8_String
type uses 8-bit characters to store the string in a more
efficient way (memory-wise).
The length of the string using the Wide_Wide_String
type is equivalent
to the number of symbols we have in the original string: 26 characters /
symbols. When using UTF-8, however, we may need more 8-bit characters to
represent one symbol from the original string, so we may end up with a length
value that is bigger than the actual number of symbols from the original string
— as it is the case in this source-code example.
This difference in sizes might not always be the case. In fact, the sizes match when encoding a symbol in UTF-8 that requires four 8-bit characters. For example:
In this case, both strings — using the Wide_Wide_String
type or
the UTF_8_String
type — have the same size: 32 bits.
Portability of UTF-8 in source-code files
In the previous code example, we were assuming that the format that we use for the source-code file itself is UTF-8. This allows us to simply use emojis — and other Unicode symbols — directly in strings:
Emoji_Symbol : constant UTF_8_String := "😀";
This approach, however, might not be portable. For example, if the compiler uses a different string encoding for source-code files, it might interpret that Unicode symbol as something else — or just throw a compilation error.
If you're afraid that format mismatches might happen in your compilation
environment, you may want to write strings in your code in a completely
portable fashion, which consists in entering the exact sequence of codes in
bytes — using the Character'Val
function — for the symbols
you want to use.
We can reuse parts of the previous example and replace the UTF-8 symbol with the corresponding UTF-8 code:
Here, we use a sequence of four calls to the Character'Val(code)
function for the UTF-8 code that corresponds to the "😀" symbol.
UTF-16 encoding and decoding
So far, we've discussed the UTF-8 encoding scheme. However, other encoding
schemes exist and are supported as well. In fact, the
Ada.Strings.UTF_Encoding
package defines three encoding schemes:
type Encoding_Scheme is (UTF_8, UTF_16BE, UTF_16LE);
For example, instead of using UTF-8 encoding, we can use UTF-16 encoding.
To convert between UTF-8 and UTF-16 encoding schemes, we can make use of the
conversion functions from the Ada.Strings.UTF_Encoding.Conversions
package.
To declare a UTF-16 encoded string, we can use one of the following data types:
the 8-bit-character based
UTF_String
type, orthe 16-bit-character based
UTF_16_Wide_String
type.
When using the 8-bit version, though, we have to specify the input and output schemes when converting between UTF-8 and UTF-16 encoding schemes.
Let's see a code example that makes use of both UTF_String
and
UTF_16_Wide_String
types:
In this example, we're declaring a UTF-8 encoded string and storing it in the
World_Emoji_UTF_8
constant. Then, we're calling the Convert
functions to convert between UTF-8 and UTF-16 encoding schemes. We're using two
versions of this function:
the
Convert
function that returns an object ofUTF_16_Wide_String
type for an input ofUTF_8_String
type, andthe
Convert
function that returns an object ofUTF_String
type for an input ofUTF_8_String
type.In this case, we need to specify the input and output schemes (see
Input_Scheme
andOutput_Scheme
parameters in the code example).
Previously, we've seen that the
Ada.Strings.UTF_Encoding.Wide_Wide_Strings
package offers functions to
convert between UTF-8 and the Wide_Wide_String
type. The same kind of
conversion functions exist for UTF-16 strings as well. Let's look at this code
example:
In this example, we're calling the Wide_Character'Val
function to
specify the UTF-16 code for an emoji — the "🌐" symbol. We're then using
the Decode
function to convert between the UTF_16_Wide_String
and
the Wide_Wide_String
types.
Relevant topics
Todo
Complete section!