Skip to content

Handle the full-text content in other language #249

@dawnyesky

Description

@dawnyesky

The result of retrieving non English webpage is not encoded well. It returned the strings of hex digits (e.g. "中新网") instead of encoded text. Is there a way to fix it? I tried the CLI version of Mercury Parser and pass the parameter --format markdown, which resulting in correct text. But I have no idea how to add this kind of parameter in calling the mercury-parser-api. Please try the example URLs below to reproduce the problem:

  1. https://news.sina.com.cn/c/2021-01-23/doc-ikftssan9988691.shtml
  2. http://www.chinanews.com/sh/2021/01-24/9395190.shtml

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions