Discussion:
windows-1251 to utf-8
(too old to reply)
e***@gmail.com
2018-10-31 02:57:15 UTC
Permalink
I get HTML from web-server in windows-1251 encoding.
How do convert HTML in windows-1251 to utf-8?
Thank.
g***@hotmail.com
2018-10-31 06:09:02 UTC
Permalink
Have a look here:

https://sf.net/p/wasabee/code/HEAD/tree/zrt_dev/common/wasabee-encoding.adb

HTH
G.
Dmitry A. Kazakov
2018-10-31 10:01:47 UTC
Permalink
Post by e***@gmail.com
I get HTML from web-server in windows-1251 encoding.
How do convert HTML in windows-1251 to utf-8?
The encoding table is this:

https://en.wikipedia.org/wiki/Windows-1251

The 7-bit codes correspond to UTF-8 directly. For 8-bit codes (for all
codes actually) you take the number from the table, e.g. Cyrillic
capital Ц -> 16#0426# and convert it to UTF-8 sequence using, for
example this:

http://www.dmitry-kazakov.de/ada/strings_edit.htm#7

The function Strings_Edit.UTF8.Image takes code point and returns UTF-8
equivalent, so

Strings_Edit.UTF8.Image (16#0426#)

gives Ц in UTF-8.

HTML is an unrelated story. Do you mean RFC 2396 escape sequences? This
is an alternative representation that has nothing to do with Windows-1251.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
e***@gmail.com
2018-10-31 15:28:32 UTC
Permalink
Let's make it easier. For example:

------------------------------------------------------------------

with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;

with AWS.Client; use AWS.Client;
with AWS.Messages; use AWS.Messages;
with AWS.Response; use AWS.Response;

procedure Main is

HTML_Result : Unbounded_String;
Request_Header_List : Header_List;

begin

Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");

HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));

Put_Line(HTML_Result);

end Main;

------------------------------------------------------------------

My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA

If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79

Are there standard ways to solve this problem?
Shark8
2018-10-31 16:50:50 UTC
Permalink
Post by e***@gmail.com
Are there standard ways to solve this problem?
I *think* you can use Character-mapping to translate from Windows-1251 to UTF-X... although I'm unsure if it has to be the same character-size.

Failing that, maybe Matreshka -- http://forge.ada-ru.org/matreshka -- has something for it. I haven't used Matreshka [yet] but there's supposedly a big Unicode/manipulation library in it.
Dmitry A. Kazakov
2018-10-31 17:01:21 UTC
Permalink
Post by e***@gmail.com
------------------------------------------------------------------
with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
with AWS.Client; use AWS.Client;
with AWS.Messages; use AWS.Messages;
with AWS.Response; use AWS.Response;
procedure Main is
HTML_Result : Unbounded_String;
Request_Header_List : Header_List;
begin
Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));
Put_Line(HTML_Result);
end Main;
------------------------------------------------------------------
My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79
Are there standard ways to solve this problem?
What problem? The page uses the content charset=windows-1251. It is legal.

Your program is illegal as it prints the body using Put_Line. Ada
standard requires Character be Latin-1. The only case when your program
would be correct is when charset=ISO-8859-1.

You must convert the page body according to the encoding specified by
the charset key into a string containing UTF-8 octets and use
Streams.Stream_IO to write these octets as-is. The conversion for the
case of windows-1251 I described earlier. Create a table Character'Pos
0..255 -> Code_Point and use it for each "character" of HTML_Result.

P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the
underlying OS.

P.P.S. Technically AWS also ignores Ada standard. But that is an
established practice. Since there is no better way.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
Randy Brukardt
2018-10-31 20:58:21 UTC
Permalink
Post by Dmitry A. Kazakov
Post by e***@gmail.com
------------------------------------------------------------------
with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
with AWS.Client; use AWS.Client;
with AWS.Messages; use AWS.Messages;
with AWS.Response; use AWS.Response;
procedure Main is
HTML_Result : Unbounded_String;
Request_Header_List : Header_List;
begin
Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0
(X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers
=> Request_Header_List));
Put_Line(HTML_Result);
end Main;
------------------------------------------------------------------
https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
https://photos.app.goo.gl/goN5g7uofD8rYLP79
Are there standard ways to solve this problem?
What problem? The page uses the content charset=windows-1251. It is legal.
Your program is illegal as it prints the body using Put_Line. Ada standard
requires Character be Latin-1. The only case when your program would be
correct is when charset=ISO-8859-1.
You must convert the page body according to the encoding specified by the
charset key into a string containing UTF-8 octets and use
Streams.Stream_IO to write these octets as-is. The conversion for the case
of windows-1251 I described earlier. Create a table Character'Pos
0..255 -> Code_Point and use it for each "character" of HTML_Result.
P.S. GNAT Text_IO ignores Latin-1, but that is between GNAT and the
underlying OS.
P.P.S. Technically AWS also ignores Ada standard. But that is an
established practice. Since there is no better way.
Right. Probably the easiest way to do this (using just Ada functions) would
be to:

(A) Use Ada.Characters to convert the To_String of the unbounded string to
a Wide_String, and then store that in a Wide_Unbounded_String (or is that a
Unbounded_Wide_String?)
(B) Use Ada.Strings.Wide_Maps to create a character conversion map (the
conversions were described by another reply);
(C) Use Ada.Strings.Wide_Unbounded.Translate to apply the mapping from (B)
to your Wide_Unbounded_String.
(D) Use Ada.Strings.UTF_Encoding.Wide_Strings.Encode to convert
To_Wide_String to your translated Wide_Unbounded_String, presumably storing
the result into a Unbounded_String.

You potentially could skip (D) if Wide_Text_IO works when sent to
Standard_Output (I'd expect that on Windows, no idea on Linux). In that
case, use Wide_Text_IO.Put to send your result.

In any case, this shows why Unicode exists, and why anything these days that
uses non-standard encodings is evil. There's really no short-cut to recoding
such things, and that makes them maddening.

Randy.
Björn Lundin
2018-11-01 12:49:00 UTC
Permalink
Post by e***@gmail.com
------------------------------------------------------------------
with Ada.Strings.Unbounded; use Ada.Strings.Unbounded;
with Ada.Text_IO.Unbounded_IO; use Ada.Text_IO.Unbounded_IO;
with AWS.Client; use AWS.Client;
with AWS.Messages; use AWS.Messages;
with AWS.Response; use AWS.Response;
procedure Main is
HTML_Result : Unbounded_String;
Request_Header_List : Header_List;
begin
Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");
HTML_Result := Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List));
Put_Line(HTML_Result);
end Main;
------------------------------------------------------------------
My linux terminal (default UTF-8) show: https://photos.app.goo.gl/EPgwKoiFSuwkJvgSA
If set encoding in terminal Windows-1251 - all is well: https://photos.app.goo.gl/goN5g7uofD8rYLP79
Are there standard ways to solve this problem?
In xml/ada there are unicode packages.

something like (with changes for 1251 instead of Latin_1 to be done)

with Unicode.Ces.Utf8, Unicode.Ces.Utf32, Unicode.Ces.Basic_8bit,
Unicode.Ccs.ISO_8859_1;
use Unicode, Unicode.Ccs, Unicode.Ces, Unicode.Ces.Utf8, Unicode.Ces.Utf32;

--some with are likely not needed, code copied from bigger function


function To_Utf_8_From_Latin_1_Little_Endian
(A_Latin_1_Encoded_String : in String)
return String is

-- 32-bit Latin-1 string (normal Ada string with 32-bit characters)
S_32 : Unicode.Ces.Utf32.Utf32_Le_String :=
Unicode.Ces.Basic_8bit.To_Utf32 (A_Latin_1_Encoded_String);

-- UTF-32 string (convert Latin-1 to Unicode characters)
U_32 : Unicode.Ces.Utf32.Utf32_Le_String :=
Unicode.Ces.Utf32.To_Unicode_Le
(S_32,
Cs => Unicode.Ccs.ISO_8859_1.ISO_8859_1_Character_Set);
-- change UTF-32 to UTF-8
An_Utf_8_Encoded_String_Le : Unicode.Ces.Utf8.Utf8_String :=
Unicode.Ces.Utf8.From_Utf32 (U_32);

begin
return An_Utf_8_Encoded_String_Le;
end To_Utf_8_From_Latin_1_Little_Endian;

---------------------------------------------------------------------------------


It's a starting point
--
--
Björn
Dmitry A. Kazakov
2018-11-01 13:26:39 UTC
Permalink
Post by Björn Lundin
something like (with changes for 1251 instead of Latin_1 to be done)
You probably mean 1252 which almost Latin-1. 1251 is totally different.
it has Cyrillic letters in the upper half of 8-bit codes, in the place
where 1252 keeps Central European letters with fancy diacritic marks.

Maybe I will add 1251 and 1252 in the next release of Strings editing
library.
--
Regards,
Dmitry A. Kazakov
http://www.dmitry-kazakov.de
Björn Lundin
2018-11-01 14:34:28 UTC
Permalink
Post by Dmitry A. Kazakov
Post by Björn Lundin
something like (with changes for 1251 instead of Latin_1 to be done)
You probably mean 1252 which almost Latin-1.
I do.
Post by Dmitry A. Kazakov
1251 is totally different.
it has Cyrillic letters in the upper half of 8-bit codes, in the place
where 1252 keeps Central European letters with fancy diacritic marks.
And I also found that the code in last post can be replaced by

-------------------------------------------------------
function To_Iso_Latin_15(Str : Unicode.CES.Byte_Sequence) return String is
use Unicode.Encodings;
begin
return Convert(Str => Str,
From => Get_By_Name("utf-8"),
To => Get_By_Name("iso-8859-15"));

end To_Iso_Latin_15;
-------------------------------------------------------

I also see that the unicode package in xml/ada has support for
1251 and 1252.

package Unicode.CCS.Windows_1251 is ...

the withs are
with Ada.Exceptions; use Ada.Exceptions;
with Unicode.Names.Cyrillic; use Unicode.Names.Cyrillic;
with Unicode.Names.Basic_Latin; use Unicode.Names.Basic_Latin;
with Unicode.Names.Latin_1_Supplement; use Unicode.Names.Latin_1_Supplement;
with Unicode.Names.Currency_Symbols; use Unicode.Names.Currency_Symbols;
with Unicode.Names.General_Punctuation;
use Unicode.Names.General_Punctuation;
with Unicode.Names.Letterlike_Symbols;
use Unicode.Names.Letterlike_Symbols;



which suggests to me that it is the cyrillic one


which (I think) would make the function above


-------------------------------------------------------
function To_Windows_1251(Str : Unicode.CES.Byte_Sequence) return String is
use Unicode.Encodings;
begin
return Convert(Str => Str,
From => Get_By_Name("utf-8"),
To => Get_By_Name("Windows-1251"));

end To_Windows_1251;
-------------------------------------------------------
--
--
Björn
Vadim Godunko
2018-11-01 18:14:34 UTC
Permalink
You can use Matreshka's text codecs, here is example.

with Ada.Text_IO; use Ada.Text_IO;

with AWS.Client; use AWS.Client;
with AWS.Response; use AWS.Response;

with League.Strings; use League.Strings;
with League.Text_Codecs; use League.Text_Codecs;

procedure Main is
Request_Header_List : Header_List;
CP1251_Codec : Text_Codec := Codec (To_Universal_String ("cp1251"));
Text : Universal_String;

begin

Request_Header_List.Add(Name => "User-Agent", Value => "Mozilla/5.0 (X11; Linux x86_64; rv:56.0) Gecko/20100101 Firefox/56.0");

Text := CP1251_Codec.Decode (Message_Body(Get(URL => "http://www.sql.ru/", Headers => Request_Header_List)));

Put_Line(Text.To_UTF_8_String);

end Main;

Loading...