Discussion:
pdftohtml encoding question
François Patte
2008-03-10 22:27:13 UTC
Permalink
bonsoir,

I am trying to convert a pdf file into html using pdftohtml provided by f8.

I get an html file with "nice" characters like: ??? insead of apostroph,
or ?? instead of ?...

so i think that there is some coding problem.

Using man pdftohtml, I got this info:
- -enc <string>
~ output text encoding name


but, I am unable to guess what is the syntax to use in order to have a
correct output in utf8 for:

Error: Couldn't find unicodeMap file for the 'utf8' encoding

is the only answer I get if I try:

pdftohtml -enc utf8 myfile.pdf


i tried utf-8, latin1, latin-1, ISO_8859-1, .... without any success.


If somebody knows... many thnaks in advance.


- --
Fran?ois Patte
UFR de math?matiques et informatique
Universit? Paris Descartes
45, rue des Saints P?res
F-75270 Paris Cedex 06
T?l. +33 (0)1 44 55 35 61
http://www.math-info.univ-paris5.fr/~patte
Andras Simon
2008-03-11 12:40:21 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
bonsoir,
I am trying to convert a pdf file into html using pdftohtml provided by f8.
I get an html file with "nice" characters like: ??? insead of apostroph,
or ?(c) instead of ?...
so i think that there is some coding problem.
- -enc <string>
~ output text encoding name
but, I am unable to guess what is the syntax to use in order to have a
Error: Couldn't find unicodeMap file for the 'utf8' encoding
pdftohtml -enc utf8 myfile.pdf
i tried utf-8, latin1, latin-1, ISO_8859-1, .... without any success.
If somebody knows... many thnaks in advance.
I don't, but

man pdftohtml

-> Pdftohtml was developed by Gueorgui Ovtcharov and Rainer Dorsch. It is
based and benefits a lot from Derek Noonburg?s xpdf package.

man xpdf

-> -enc encoding-name
Sets the encoding to use for text output. The encoding-name
must be defined with the unicodeMap command (see xpdfrc(5)).
This defaults to "Latin1" (which is a built-in encoding). [con-
fig file: textEncoding]

man xpdfrc

-> unicodeMap encoding-name map-file
[...]
The Latin1, ASCII7, Symbol, ZapfDingbats, UTF-8, and
UCS-2 encodings are predefined.

I'm afraid you'll have to read the elided part if you need an encoding
other than these six.

Hope this helps,

Andras
François Patte
2008-03-12 07:47:45 UTC
Permalink
Le 11.03.2008 13:40, Andras Simon a ?crit :
| On 3/10/08, Fran?ois Patte <francois.patte at math-info.univ-paris5.fr>
wrote:
|>
|> I am trying to convert a pdf file into html using pdftohtml provided
by f8.
|>
|> I get an html file with "nice" characters like: ??? insead of apostroph,
|> or ?(c) instead of ?...

|
| I don't, but
|
| man pdftohtml
<snip>

Thanks for answering. The problem was solved when I looked to the html
file produced: this line was missing

~ <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

though the pdf file was produced from latex with utf8 encoding.

One mystery remains: why the default encoding for navigator (firefox),
or openoffice, is latin1?

Best regards
- --
Fran?ois Patte
UFR de math?matiques et informatique
Universit? Paris Descartes
45, rue des Saints P?res
F-75270 Paris Cedex 06
T?l. +33 (0)1 44 55 35 61
http://www.math-info.univ-paris5.fr/~patte
Tim
2008-03-12 11:07:57 UTC
Permalink
The problem was solved when I looked to the html file produced: this
line was missing
~ <meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
though the pdf file was produced from latex with utf8 encoding.
One mystery remains: why the default encoding for navigator (firefox),
or openoffice, is latin1?
The default encoding for web browsing and serving, according to the HTTP
specifications is iso-8859-1, anything different needs explicitly
stating otherwise. The meta statement is one way to do that, and about
the only choice you have if you open a file directly, rather than web
serve it. If it is served, then the HTTP headers about content type
overrule anything typed into the file itself (the meta statement is to
be ignored). If you set your browser's default to something other than
iso-8859-1 you'll have problems with rendering pages that are served
correctly (i.e. iso-8859-1 written pages without a specific content type
description about it), and that's an awful lot of web pages.
--
(This computer runs FC7, my others run FC4, FC5 & FC6, in case that's
important to the thread.)

Don't send private replies to my address, the mailbox is ignored.
I read messages from the public lists.
Tim
2008-03-12 11:07:57 UTC
Permalink
The problem was solved when I looked to the html file produced: this
line was missing
~ <meta http-equiv="Content-Type" content="text/html;
charset=utf-8">
though the pdf file was produced from latex with utf8 encoding.
One mystery remains: why the default encoding for navigator (firefox),
or openoffice, is latin1?
The default encoding for web browsing and serving, according to the HTTP
specifications is iso-8859-1, anything different needs explicitly
stating otherwise. The meta statement is one way to do that, and about
the only choice you have if you open a file directly, rather than web
serve it. If it is served, then the HTTP headers about content type
overrule anything typed into the file itself (the meta statement is to
be ignored). If you set your browser's default to something other than
iso-8859-1 you'll have problems with rendering pages that are served
correctly (i.e. iso-8859-1 written pages without a specific content type
description about it), and that's an awful lot of web pages.
--
(This computer runs FC7, my others run FC4, FC5 & FC6, in case that's
important to the thread.)

Don't send private replies to my address, the mailbox is ignored.
I read messages from the public lists.
François Patte
2008-03-12 07:47:45 UTC
Permalink
Le 11.03.2008 13:40, Andras Simon a ?crit :
| On 3/10/08, Fran?ois Patte <francois.patte at math-info.univ-paris5.fr>
wrote:
|>
|> I am trying to convert a pdf file into html using pdftohtml provided
by f8.
|>
|> I get an html file with "nice" characters like: ??? insead of apostroph,
|> or ?(c) instead of ?...

|
| I don't, but
|
| man pdftohtml
<snip>

Thanks for answering. The problem was solved when I looked to the html
file produced: this line was missing

~ <meta http-equiv="Content-Type" content="text/html; charset=utf-8">

though the pdf file was produced from latex with utf8 encoding.

One mystery remains: why the default encoding for navigator (firefox),
or openoffice, is latin1?

Best regards
- --
Fran?ois Patte
UFR de math?matiques et informatique
Universit? Paris Descartes
45, rue des Saints P?res
F-75270 Paris Cedex 06
T?l. +33 (0)1 44 55 35 61
http://www.math-info.univ-paris5.fr/~patte
François Patte
2008-03-10 22:27:13 UTC
Permalink
bonsoir,

I am trying to convert a pdf file into html using pdftohtml provided by f8.

I get an html file with "nice" characters like: ??? insead of apostroph,
or ?? instead of ?...

so i think that there is some coding problem.

Using man pdftohtml, I got this info:
- -enc <string>
~ output text encoding name


but, I am unable to guess what is the syntax to use in order to have a
correct output in utf8 for:

Error: Couldn't find unicodeMap file for the 'utf8' encoding

is the only answer I get if I try:

pdftohtml -enc utf8 myfile.pdf


i tried utf-8, latin1, latin-1, ISO_8859-1, .... without any success.


If somebody knows... many thnaks in advance.


- --
Fran?ois Patte
UFR de math?matiques et informatique
Universit? Paris Descartes
45, rue des Saints P?res
F-75270 Paris Cedex 06
T?l. +33 (0)1 44 55 35 61
http://www.math-info.univ-paris5.fr/~patte
Andras Simon
2008-03-11 12:40:21 UTC
Permalink
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1
bonsoir,
I am trying to convert a pdf file into html using pdftohtml provided by f8.
I get an html file with "nice" characters like: ??? insead of apostroph,
or ?(c) instead of ?...
so i think that there is some coding problem.
- -enc <string>
~ output text encoding name
but, I am unable to guess what is the syntax to use in order to have a
Error: Couldn't find unicodeMap file for the 'utf8' encoding
pdftohtml -enc utf8 myfile.pdf
i tried utf-8, latin1, latin-1, ISO_8859-1, .... without any success.
If somebody knows... many thnaks in advance.
I don't, but

man pdftohtml

-> Pdftohtml was developed by Gueorgui Ovtcharov and Rainer Dorsch. It is
based and benefits a lot from Derek Noonburg?s xpdf package.

man xpdf

-> -enc encoding-name
Sets the encoding to use for text output. The encoding-name
must be defined with the unicodeMap command (see xpdfrc(5)).
This defaults to "Latin1" (which is a built-in encoding). [con-
fig file: textEncoding]

man xpdfrc

-> unicodeMap encoding-name map-file
[...]
The Latin1, ASCII7, Symbol, ZapfDingbats, UTF-8, and
UCS-2 encodings are predefined.

I'm afraid you'll have to read the elided part if you need an encoding
other than these six.

Hope this helps,

Andras
Loading...