Trouble cleaning up HTML - *nix

TopAnswers *nix

Meta

Databases

TeX

Code Golf

APL

C++

.net

db<>fiddle

Java

*nix

PHP

PowerShell

Python

Rust

टेक्-मराठी

Typst

Web Client Dev

Web Server Dev

Trouble cleaning up HTML

shell add tag

PeterVandivier

I'm trying to extract a table element from the online docs page at https://www.openssl.org/docs/man1.1.1/man1/ [^1]. 

I've extracted the XPath to the element from chrome developer tools... `//*[@id="content"]/div/article/div/table`

![Screenshot 2020-09-08 at 09.44.23.png](/image?hash=6e903bfd979e9e931938ee086192d6d135948ecfd6e432ca4364440577a59e82)

And I started by dumping the file to disk.

```bash
curl -s https://www.openssl.org/docs/man1.1.1/man1/ > index.html
```

Sadly, there's a JavaScript command and a few control characters that aren't playing nice with [xpath](https://www.unix.com/man-page/osx/1p/xpath/)

> not well-formed (invalid token) at line 19, column 26, byte 702:
> 
> undefined entity at line 184, column 8, byte 12584:
> 
> undefined entity at line 217, column 14, byte 13369:

I'm lazy and `sed` is a thing, so I stripped these out.

```bash
curl -s https://www.openssl.org/docs/man1.1.1/man1/ | \
    sed 's/&&//' | \
    sed -E 's/(&[a-z]{4});//' \
  > index.html
```

Sadly however, I'm still getting the following error when running `xpath index.html '//*[@id="content"]/div/article/div/table'`

> mismatched tag at line 223, column 2, byte 13451:
> 
> </body>
> </html>
> =^
>  at /System/Library/Perl/Extras/5.18/darwin-thread-multi-2level/XML/Parser.pm line 187.

Which makes it look like the top-level `<html>` element is malformed. For the life of me I can't figured out where the error is. 

For what it's worth, I get the same error when removing the troublesome elements manually instead of using `sed`. 

Is there something I can do to extract this element properly instead of resorting to [string parsing my html](https://stackoverflow.com/a/1732454/4709762)?

---

[^1]: Never mind why, really. this is _very much_ several levels down on an X-Y problem where I finally have _a single_ atomic issue I can ask about.

Top Answer

wizzwizz4

Your page isn't valid XHTML. I'll just snip out the middle to make it easier to see:

```html
<!DOCTYPE html>
<html lang="en">
<!-- head.inc -->
  <title>
  /docs/man1.1.1/man1/index.html
  </title>
  <meta charset="utf-8">

  *snip*

  <link href="//fonts.googleapis.com/css?family=PT+Serif:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">
  <link href="//fonts.googleapis.com/css?family=PT+Sans:regular,italic,bold,bolditalic" rel="stylesheet" type="text/css">

  <!--[if lt IE 9]>
    <script src="https://html5shiv.googlecode.com/svn/trunk/html5.js"></script>
  <![endif]-->
<!-- end -->
  *snip*
<body>
  *snip*
</body>
</html>
```

Notice anything? Here's another hint: the XML parser was looking for a `</link>`.

Yes: [SGML](https://en.wikipedia.org/wiki/Standard_Generalized_Markup_Language)'s self-closing tags need to end with a `/>`, not a `>`, to be valid XML. You're lucky that the (stuff that should be in) `<head>` is the only issue; if you strip everything from `<!-- head.inc -->` to the first `<!-- end -->` before running it through the XML parser, it should work.

1 Answer