MyNixOS website logo
Description

Command line tool to extract DSV data from HTML and XML with XPATH expressions.

Please see README.md

xpathdsv

Extract DSV text from HTML and XML using XPATH expressions.

Available on Hackage.

Example

If you have an HTML file like this:

sample.html

<html>
  <head><title>Test</title></head>
  <body>
    <h1>Some links</h1>
    <ul>
      <li><a href="http://news.ycombinator.com">Hacker News</a></li>
      <li><a href="http://yahoo.com">Yahoo</a>
      <li><a href="http://duckduckgo.com">Duck Duck Go</a>
      <li><a href="http://github.com">GitHub</a>
    </ul>
  </body>
</html>

You can extract a list of tab-separated values like this:

xpathdsv  '//a'  '/a/text()' '/a/@href' < sample.html

Output:

Hacker News	http://news.ycombinator.com
Yahoo	http://yahoo.com
Duck Duck Go	http://duckduckgo.com
GitHub	http://github.com

The first XPATH expression in the command sets the base node on which all the following XPATH expressions are applied. Each of the following XPATH expressions then generate a column of the row of data.

If you don't specify a text() node at the end of an XPATH expression, you'll get a string representation of a node if the node is not an attribute, which may be useful for debugging:

 xpathdsv '//a' '/a' < sample.html

Output:

<a href="http://news.ycombinator.com">Hacker News</a>
<a href="http://yahoo.com">Yahoo</a>
<a href="http://duckduckgo.com">Duck Duck Go</a>
<a href="http://github.com">GitHub</a>

Usage

xpathdsv

Usage: xpathdsv [--xml] [-F OUTPUT-DELIM] [-n NULL-OUTPUT] BASE-XPATH
                [CHILD-XPATH]
  Extract DSV data from HTML or XML with XPath

Available options:
  -h,--help                Show this help text
  --xml                    Parse as XML, rather than HTML.
  -F OUTPUT-DELIM          Default \t
  -n NULL-OUTPUT           Null value output string. Default ""

See https://github.com/danchoi/xpathdsv for more information.

Author

Daniel Choi https://github.com/danchoi

References

Metadata

Version

0.1.1.0

Executables (1)

  • bin/xpathdsv

Platforms (77)

    Darwin
    FreeBSD
    Genode
    GHCJS
    Linux
    MMIXware
    NetBSD
    none
    OpenBSD
    Redox
    Solaris
    WASI
    Windows
Show all
  • aarch64-darwin
  • aarch64-freebsd
  • aarch64-genode
  • aarch64-linux
  • aarch64-netbsd
  • aarch64-none
  • aarch64-windows
  • aarch64_be-none
  • arm-none
  • armv5tel-linux
  • armv6l-linux
  • armv6l-netbsd
  • armv6l-none
  • armv7a-darwin
  • armv7a-linux
  • armv7a-netbsd
  • armv7l-linux
  • armv7l-netbsd
  • avr-none
  • i686-cygwin
  • i686-darwin
  • i686-freebsd
  • i686-genode
  • i686-linux
  • i686-netbsd
  • i686-none
  • i686-openbsd
  • i686-windows
  • javascript-ghcjs
  • loongarch64-linux
  • m68k-linux
  • m68k-netbsd
  • m68k-none
  • microblaze-linux
  • microblaze-none
  • microblazeel-linux
  • microblazeel-none
  • mips-linux
  • mips-none
  • mips64-linux
  • mips64-none
  • mips64el-linux
  • mipsel-linux
  • mipsel-netbsd
  • mmix-mmixware
  • msp430-none
  • or1k-none
  • powerpc-netbsd
  • powerpc-none
  • powerpc64-linux
  • powerpc64le-linux
  • powerpcle-none
  • riscv32-linux
  • riscv32-netbsd
  • riscv32-none
  • riscv64-linux
  • riscv64-netbsd
  • riscv64-none
  • rx-none
  • s390-linux
  • s390-none
  • s390x-linux
  • s390x-none
  • vc4-none
  • wasm32-wasi
  • wasm64-wasi
  • x86_64-cygwin
  • x86_64-darwin
  • x86_64-freebsd
  • x86_64-genode
  • x86_64-linux
  • x86_64-netbsd
  • x86_64-none
  • x86_64-openbsd
  • x86_64-redox
  • x86_64-solaris
  • x86_64-windows