PK ! /R HISTORY.rstHistory
=======
1.0.1 (2019-02-07)
------------------
- Accept both .yaml and .yml as valid YAML file extensions.
- Documentation fixes.
1.0 (2018-05-25)
----------------
- Bumped version to 1.0.
1.0b7 (2018-03-21)
------------------
- Dropped support for Python 3.3.
- Fixes for handling Unicode data in HTML for Python 2.
- Added registry for preprocessors.
1.0b6 (2018-01-17)
------------------
- Support for writing specifications in YAML.
1.0b5 (2018-01-16)
------------------
- Added a class-based API for writing specifications.
- Added predefined transformation functions.
- Removed callables from specification maps. Use the new API instead.
- Added support for registering new reducers and transformers.
- Added support for defining sections in document.
- Refactored XPath evaluation method in order to parse path expressions once.
- Preprocessing will be done only once when the tree is built.
- Concatenation is now the default reducing operation.
1.0b4 (2018-01-02)
------------------
- Added "--version" option to command line arguments.
- Added option to force the use of lxml's HTML builder.
- Fixed the error where non-truthy values would be excluded from the result.
- Added support for transforming node text during preprocess.
- Added separate preprocessing function to API.
- Renamed the "join" reducer as "concat".
- Renamed the "foreach" keyword for keys as "section".
- Removed some low level debug messages to substantially increase speed.
1.0b3 (2017-07-25)
------------------
- Removed the caching feature.
1.0b2 (2017-06-16)
------------------
- Added helper function for getting cache hash keys of URLs.
1.0b1 (2017-04-26)
------------------
- Added optional value transformations.
- Added support for custom reducer callables.
- Added command-line option for scraping documents from local files.
1.0a2 (2017-04-04)
------------------
- Added support for Python 2.7.
- Fixed lxml support.
1.0a1 (2016-08-24)
------------------
- First release on PyPI.
PK ! |a a
docs/Makefile# Minimal makefile for Sphinx documentation
#
# You can set these variables from the command line.
SPHINXOPTS =
SPHINXBUILD = sphinx-build
SPHINXPROJ = piculet
SOURCEDIR = source
BUILDDIR = build
# Put it first so that "make" without argument is like "make help".
help:
@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
.PHONY: help Makefile
# Catch-all target: route all unknown targets to Sphinx using the new
# "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS).
%: Makefile
@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
PK ! docs/source/_static/custom.cssPK ! JD D docs/source/api.rstAPI
===
.. automodule:: piculet
:members:
:show-inheritance:
PK ! M" " docs/source/conf.pyimport sys
import os
# If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here.
sys.path.insert(0, os.path.abspath('..'))
# -- General configuration ------------------------------------------------
# If your documentation needs a minimal Sphinx version, state it here.
# needs_sphinx = '1.0'
# Add any Sphinx extension module names here, as strings. They can be
# extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
# ones.
extensions = [
'sphinx.ext.autodoc',
'pygenstub'
]
# Add any paths that contain templates here, relative to this directory.
templates_path = ['_templates']
# The suffix(es) of source filenames.
# You can specify multiple suffix as a list of string:
# source_suffix = ['.rst', '.md']
source_suffix = '.rst'
# The encoding of source files.
# source_encoding = 'utf-8-sig'
# The master toctree document.
master_doc = 'index'
# General information about the project.
project = 'Piculet'
copyright = '2016-2018, H. Turgut Uyar'
author = 'H. Turgut Uyar'
# The version info for the project you're documenting, acts as replacement for
# |version| and |release|, also used in various other places throughout the
# built documents.
#
# The short X.Y version.
version = '1.0'
# The full version, including alpha/beta/rc tags.
release = '1.0'
# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
#
# This is also used if you do content translation via gettext catalogs.
# Usually you set "language" from the command line for these cases.
language = None
# There are two options for replacing |today|: either, you set today to some
# non-false value, then it is used:
# today = ''
# Else, today_fmt is used as the format for a strftime call.
# today_fmt = '%B %d, %Y'
# List of patterns, relative to source directory, that match files and
# directories to ignore when looking for source files.
exclude_patterns = ['_build']
# The reST default role (used for this markup: `text`) to use for all
# documents.
# default_role = None
# If true, '()' will be appended to :func: etc. cross-reference text.
# add_function_parentheses = True
# If true, the current module name will be prepended to all description
# unit titles (such as .. function::).
# add_module_names = True
# If true, sectionauthor and moduleauthor directives will be shown in the
# output. They are ignored by default.
# show_authors = False
# The name of the Pygments (syntax highlighting) style to use.
pygments_style = 'sphinx'
# A list of ignored prefixes for module index sorting.
modindex_common_prefix = ['piculet.']
# If true, keep warnings as "system message" paragraphs in the built documents.
# keep_warnings = False
# If true, `todo` and `todoList` produce output, else they produce nothing.
todo_include_todos = False
# -- Options for HTML output ----------------------------------------------
# The theme to use for HTML and HTML Help pages. See the documentation for
# a list of builtin themes.
html_theme = 'sphinx_rtd_theme'
# html_style = 'custom.css'
# Theme options are theme-specific and customize the look and feel of a theme
# further. For a list of options available for each theme, see the
# documentation.
# html_theme_options = {}
# Add any paths that contain custom themes here, relative to this directory.
# html_theme_path = []
# The name for this set of Sphinx documents. If None, it defaults to
# " v documentation".
# html_title = None
# A shorter title for the navigation bar. Default is the same as html_title.
# html_short_title = None
# The name of an image file (relative to this directory) to place at the top
# of the sidebar.
# html_logo = None
# The name of an image file (within the static path) to use as favicon of the
# docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32
# pixels large.
# html_favicon = None
# Add any paths that contain custom static files (such as style sheets) here,
# relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static']
# Add any extra paths that contain custom files (such as robots.txt or
# .htaccess) here, relative to this directory. These files are copied
# directly to the root of the documentation.
# html_extra_path = []
# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
# using the given strftime format.
# html_last_updated_fmt = '%b %d, %Y'
# If true, SmartyPants will be used to convert quotes and dashes to
# typographically correct entities.
# html_use_smartypants = True
# Custom sidebar templates, maps document names to template names.
# html_sidebars = {}
# Additional templates that should be rendered to pages, maps page names to
# template names.
# html_additional_pages = {}
# If false, no module index is generated.
# html_domain_indices = True
# If false, no index is generated.
# html_use_index = True
# If true, the index is split into individual pages for each letter.
# html_split_index = False
# If true, links to the reST sources are added to the pages.
# html_show_sourcelink = True
# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
# html_show_sphinx = True
# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
# html_show_copyright = True
# If true, an OpenSearch description file will be output, and all pages will
# contain a tag referring to it. The value of this option must be the
# base URL from which the finished HTML is served.
# html_use_opensearch = ''
# This is the file name suffix for HTML files (e.g. ".xhtml").
# html_file_suffix = None
# Language to be used for generating the HTML full-text search index.
# Sphinx supports the following languages:
# 'da', 'de', 'en', 'es', 'fi', 'fr', 'h', 'it', 'ja'
# 'nl', 'no', 'pt', 'ro', 'r', 'sv', 'tr'
# html_search_language = 'en'
# A dictionary with options for the search language support, empty by default.
# Now only 'ja' uses this config value
# html_search_options = {'type': 'default'}
# The name of a javascript file (relative to the configuration directory) that
# implements a search results scorer. If empty, the default will be used.
# html_search_scorer = 'scorer.js'
# Output file base name for HTML help builder.
htmlhelp_basename = 'piculetdoc'
# -- Options for LaTeX output ---------------------------------------------
latex_elements = {
# The paper size ('letterpaper' or 'a4paper').
'papersize': 'a4paper',
# The font size ('10pt', '11pt' or '12pt').
# 'pointsize': '10pt',
# Additional stuff for the LaTeX preamble.
# 'preamble': '',
# Latex figure (float) alignment
# 'figure_align': 'htbp',
}
# Grouping the document tree into LaTeX files. List of tuples
# (source start file, target name, title,
# author, documentclass [howto, manual, or own class]).
latex_documents = [
(master_doc, 'piculet.tex', 'Piculet Documentation',
'H. Turgut Uyar', 'manual'),
]
# The name of an image file (relative to this directory) to place at the top of
# the title page.
# latex_logo = None
# For "manual" documents, if this is true, then toplevel headings are parts,
# not chapters.
# latex_use_parts = False
# If true, show page references after internal links.
# latex_show_pagerefs = False
# If true, show URL addresses after external links.
# latex_show_urls = False
# Documents to append as an appendix to all manuals.
# latex_appendices = []
# If false, no module index is generated.
# latex_domain_indices = True
# -- Options for manual page output ---------------------------------------
# One entry per manual page. List of tuples
# (source start file, name, description, authors, manual section).
man_pages = [
(master_doc, 'piculet', 'Piculet Documentation',
[author], 1)
]
# If true, show URL addresses after external links.
# man_show_urls = False
# -- Options for Texinfo output -------------------------------------------
# Grouping the document tree into Texinfo files. List of tuples
# (source start file, target name, title, author,
# dir menu entry, description, category)
texinfo_documents = [
(master_doc, 'Piculet', 'Piculet Documentation',
author, 'Piculet', 'XML/HTML scraper using XPath queries.',
'Miscellaneous'),
]
# Documents to append as an appendix to all manuals.
# texinfo_appendices = []
# If false, no module index is generated.
# texinfo_domain_indices = True
# How to display URL addresses: 'footnote', 'no', or 'inline'.
# texinfo_show_urls = 'footnote'
# If true, do not generate a @detailmenu in the "Top" node's menu.
# texinfo_no_detailmenu = False
PK ! m5D D docs/source/extract.rstData extraction
===============
This section explains how to write the specification for extracting data
from a document. We'll scrape the following HTML content for the movie
"The Shining" in our examples:
.. literalinclude:: ../../examples/shining.html
:language: html
Instead of the :func:`scrape_document ` function
that reads the content and the specification from files, we'll use
the :func:`scrape ` function that works directly on the content
and the specification map:
.. code-block:: python
>>> from piculet import scrape
Assuming the HTML document above is saved as :file:`shining.html`, let's get
its content:
.. code-block:: python
>>> with open("shining.html") as f:
... document = f.read()
The :func:`scrape ` function assumes that the document
is in XML format. So if any conversion is needed, it has to be done
before calling this function. [#xhtml]_ After building the DOM tree,
the function will apply the extraction rules to the root element of the tree,
and return a mapping where each item is generated by one of the rules.
.. note::
Piculet uses the `ElementTree`_ module for building and querying
XML trees. However, it will make use of the `lxml`_ package if it's
installed. The :func:`scrape ` function takes
an optional ``lxml_html`` parameter which will use the HTML builder
from the lxml package, thereby building the tree without converting
HTML into XML first.
The specification mapping contains two keys: the ``pre`` key is for specifying
the preprocessing operations (these will be covered in the next section),
and the ``items`` key is for specifying the rules that describe how to extract
the data:
.. code-block:: python
spec = {"pre": [...], "items": [...]}
The items list contains item mappings, where each item has a ``key`` and
a ``value`` description. The key specifies the key for the item in the output
mapping and the value specifies how to extract the data to set as the value
for that item. Typically, a value specifier consists of a path query and
a reducing function. The query is applied to the root and a list of strings
is obtained. Then, the reducing function converts this list into a single
string. [#reducing]_
For example, to get the title of the movie from the example document,
we can write:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
The ``.//title/text()`` path generates the list ``['The Shining']``
and the reducing function ``first`` selects the first element from that list.
.. note::
By default, the XPath queries are limited by `what ElementTree supports`_
(plus the ``text()`` and ``@attr`` clauses which are added by Piculet).
However, if the `lxml`_ package is installed a
`much wider range of XPath constructs`_ can be used.
Multiple items can be collected in a single invocation:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... },
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining', 'year': '1980'}
If a path doesn't match any element in the tree, the item will be excluded
from the output. Note that in the following example, the "foo" key doesn't
get included:
>>> spec = {
... "items": [
... {
... "key": "title",
... "value": {
... "path": "//title/text()",
... "reduce": "first"
... }
... },
... {
... "key": "foo",
... "value": {
... "path": "//foo/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'title': 'The Shining'}
Reducing
--------
Piculet contains a few predefined reducing functions. Other than the ``first``
reducer used in the examples above, a very common reducer is ``concat``
which will concatenate the selected strings:
>>> spec = {
... "items": [
... {
... "key": "full_title",
... "value": {
... "path": "//h1//text()",
... "reduce": "concat"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}
``concat`` is the default reducer, i.e. if no reducer is given, the strings
will be concatenated:
>>> spec = {
... "items": [
... {
... "key": "full_title",
... "value": {
... "path": "//h1//text()"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'full_title': 'The Shining (1980)'}
If you want to get rid of extra whitespace, you can use the ``clean`` reducer.
After concatenating the strings, this will remove leading and trailing
whitespace and replace multiple whitespace with a single space:
>>> spec = {
... "items": [
... {
... "key": "review",
... "value": {
... "path": '//div[@class="review"]//text()',
... "reduce": "clean"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'review': 'Fantastic movie. Definitely recommended.'}
In this example, the ``concat`` reducer would have produced the value
``'\n Fantastic movie.\n Definitely recommended.\n '``
As explained above, if a path query doesn't match any element, the item
gets automatically excluded. That means, Piculet doesn't try to apply
the reducing function on the result of the path query if it's an empty list.
Therefore, reducing functions can safely assume that the path result is
a non-empty list.
If you want to use a custom reducer, you have to register it first. The name
for the specifier (the first parameter) has to be a valid Python identifier.
.. code-block:: python
>>> from piculet import reducers
>>> reducers.register("second", lambda x: x[1])
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": "//h1//text()",
... "reduce": "second"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': '1980'}
Transforming
------------
After the reduction operation, you can apply a transformation
to the resulting string. A transformation function must take a string
as parameter and can return any value of any type. Piculet contains several
predefined transformers: ``int``, ``float``, ``bool``, ``len``, ``lower``,
``upper``, ``capitalize``. For example, to get the year of the movie
as an integer:
>>> spec = {
... "items": [
... {
... "key": "year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first",
... "transform": "int"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'year': 1980}
If you want to use a custom transformer, you have to register it first:
.. code-block:: python
>>> from piculet import transformers
>>> transformers.register("year25", lambda x: int(x) + 25)
>>> spec = {
... "items": [
... {
... "key": "25th_year",
... "value": {
... "path": '//span[@class="year"]/text()',
... "reduce": "first",
... "transform": "year25"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'25th_year': 2005}
Multi-valued items
------------------
Data with multiple values can be created by using a ``foreach`` key
in the value specifier. This is a path expression to select elements
from the tree. [#multivalued]_ The path and reducing function will be applied
*to each selected element* and the obtained values will be the members
of the resulting list. For example, to get the genres of the movie,
we can write:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['Horror', 'Drama']}
If the ``foreach`` key doesn't match any element the item will be excluded
from the result:
>>> spec = {
... "items": [
... {
... "key": "foos",
... "value": {
... "foreach": '//ul[@class="foos"]/li',
... "path": "./text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{}
If a transformation is specified, it will be applied to every element
in the resulting list:
>>> spec = {
... "items": [
... {
... "key": "genres",
... "value": {
... "foreach": '//ul[@class="genres"]/li',
... "path": "./text()",
... "reduce": "first",
... "transform": "lower"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'genres': ['horror', 'drama']}
Subrules
--------
Nested structures can be created by writing subrules as value specifiers.
If the value specifier is a mapping that contains an ``items`` key,
then this will be interpreted as a subrule and the generated mapping
will be the value for the key.
>>> spec = {
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": {
... "path": '//div[@class="director"]//a/text()',
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": '//div[@class="director"]//a/@href',
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
Subrules can be combined with lists:
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./td[1]/a/text()",
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": "./td[1]/a/@href",
... "reduce": "first"
... }
... },
... {
... "key": "character",
... "value": {
... "path": "./td[2]/text()",
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': [{'character': 'Jack Torrance',
'link': '/people/2',
'name': 'Jack Nicholson'},
{'character': 'Wendy Torrance',
'link': '/people/3',
'name': 'Shelley Duvall'}]}
Items generated by subrules can also be transformed. The transformation
function is always applied as the last step in a "value" definition. But
transformers for subitems take mappings (as opposed to strings) as parameter.
>>> transformers.register("stars", lambda x: "%(name)s as %(character)s" % x)
>>> spec = {
... "items": [
... {
... "key": "cast",
... "value": {
... "foreach": '//table[@class="cast"]/tr',
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./td[1]/a/text()",
... "reduce": "first"
... }
... },
... {
... "key": "character",
... "value": {
... "path": "./td[2]/text()",
... "reduce": "first"
... }
... }
... ],
... "transform": "stars"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
Generating keys from content
----------------------------
You can generate items where the key value also comes from the content.
For example, consider how you would get the runtime and the language
of the movie. Instead of writing multiple items for each ``h3`` element
under an "info" class ``div``, we can write only one item that will select
these divs and use the h3 text as the key. These elements can be selected using
``foreach`` specifications in the items. This will cause a new item
to be generated for each selected element. To get the key value,
we can use paths, reducers -and also transformers- that will be applied
to the selected element:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "reduce": "first"
... },
... "value": {
... "path": "./p/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'Language:': 'English', 'Runtime:': '144 minutes'}
The ``normalize`` reducer concatenates the strings, converts it to lowercase,
replaces spaces with underscores and strips other non-alphanumeric characters:
>>> spec = {
... "items": [
... {
... "foreach": '//div[@class="info"]',
... "key": {
... "path": "./h3/text()",
... "reduce": "normalize"
... },
... "value": {
... "path": "./p/text()",
... "reduce": "first"
... }
... }
... ]
... }
>>> scrape(document, spec)
{'language': 'English', 'runtime': '144 minutes'}
You could also give a string instead of a path and reducer for the key.
In this case, the elements would still be traversed; only the last one would
set the final value for the item. This could be OK if you are sure
that there is only one element that matches the ``foreach`` path of the key.
Sections
--------
The specification also provides the ability to define sections within
the document. An element can be selected as the root of a section such that
the XPath queries in that section will be relative to that root. This can be
used to make XPath expressions shorter and also constrain the search
in the tree. For example, the "director" example above can also be written
using sections:
.. code-block:: python
>>> spec = {
... "section": '//div[@class="director"]//a',
... "items": [
... {
... "key": "director",
... "value": {
... "items": [
... {
... "key": "name",
... "value": {
... "path": "./text()",
... "reduce": "first"
... }
... },
... {
... "key": "link",
... "value": {
... "path": "./@href",
... "reduce": "first"
... }
... }
... ]
... }
... }
... ]
... }
>>> scrape(document, spec)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
.. [#xhtml]
Note that the example document is already in XML format.
.. [#reducing]
This means that the query has to end with either ``text()`` or some
attribute value as in ``@attr``. And the reducing function should be
implemented so that it takes a list of strings and returns a string.
.. [#multivalued]
This implies that the ``foreach`` query should **not** end in ``text()``
or ``@attr``.
.. _ElementTree: https://docs.python.org/3/library/xml.etree.elementtree.html
.. _what ElementTree supports: https://docs.python.org/3/library/xml.etree.elementtree.html#xpath-support
.. _lxml: http://lxml.de/
.. _much wider range of XPath constructs: http://lxml.de/xpathxslt.html#xpath
PK ! L docs/source/history.rst.. include:: ../../HISTORY.rst
PK ! E docs/source/index.rstPiculet
=======
.. include:: ../../README.rst
Contents
========
.. toctree::
:maxdepth: 2
overview
extract
preprocess
low-level
api
history
Indices and Tables
==================
* :ref:`genindex`
* :ref:`search`
PK ! m
docs/source/low-level.rst
Lower-level functions
=====================
Piculet also provides a lower-level API where you can run the stages
separately. For example, if the same document will be scraped multiple times
with different rules, calling the ``scrape`` function repeatedly will cause
the document to be parsed into a DOM tree repeatedly. Instead, you can
create the DOM tree once and run extraction rules against this tree
multiple times.
Also, this API uses classes to express the specification and therefore
development tools can help better in writing the rules by showing error
indicators and suggesting autocompletions.
Building the tree
-----------------
The DOM tree can be created from the document using
the :func:`build_tree ` function:
.. code-block:: python
>>> from piculet import build_tree
>>> root = build_tree(document)
If the document needs to be converted from HTML to XML, you can use
the :func:`html_to_xhtml ` function:
.. code-block:: python
>>> from piculet import html_to_xhtml
>>> converted = html_to_xhtml(document)
>>> root = build_tree(converted)
If lxml is available, you can use the ``lxml_html`` parameter for building
the tree without converting an HTML document into XHTML:
.. code-block:: python
>>> root = build_tree(document, lxml_html=True)
.. note::
Note that if you use the lxml.html builder, there might be differences about
how the tree is built compared to the piculet conversion method and the path
queries for preprocessing and extraction might need changes.
Preprocessing
-------------
The tree can be modified using the :func:`preprocess `
function:
.. code-block:: python
>>> from piculet import preprocess
>>> ops = [{"op": "remove", "path": '//div[class="ad"]'}]
>>> preprocess(root, ops)
Data extraction
---------------
The class-based API to data extraction has a one-to-one correspondance
with the specification mapping. A :class:`Rule ` object
corresponds to a key-value pair in the items list. Its value is produced
by an ``extractor``. In the simple case, an extractor is
a :class:`Path ` object which is a combination of a path,
a reducer, and a transformer.
.. code-block:: python
>>> from piculet import Path, Rule, reducers, transformers
>>> extractor = Path('//span[@class="year"]/text()',
... reduce=reducers.first,
... transform=transformers.int)
>>> rule = Rule(key="year", extractor=extractor)
>>> rule.extract(root)
{'year': 1980}
An extractor can have a ``foreach`` attribute if it will be multi-valued:
.. code-block:: python
>>> extractor = Path(foreach='//ul[@class="genres"]/li',
... path="./text()",
... reduce=reducers.first,
... transform=transformers.lower)
>>> rule = Rule(key="genres", extractor=extractor)
>>> rule.extract(root)
{'genres': ['horror', 'drama']}
The ``key`` attribute of a rule can be an extractor in which case it can be
used to extract the key value from content. A rule can also have a ``foreach``
attribute for generating multiple items in one rule. These features will work
as they are described in the data extraction section.
A :class:`Rules ` object contains a collection of rule objects
and it corresponds to the "items" part in the specification mapping. It acts
both as the top level extractor that gets applied to the root of the tree,
and also as an extractor for any rule with subrules.
.. code-block:: python
>>> from piculet import Rules
>>> rules = [Rule(key="title",
... extractor=Path("//title/text()")),
... Rule(key="year",
... extractor=Path('//span[@class="year"]/text()',
... transform=transformers.int))]
>>> Rules(rules).extract(root)
{'title': 'The Shining', 'year': 1980}
A more complete example with transformations is below. Again note that,
the specification is exactly the same as given in the corresponding
mapping example in the data extraction chapter.
.. code-block:: python
>>> rules = [
... Rule(key="cast",
... extractor=Rules(
... foreach='//table[@class="cast"]/tr',
... rules=[
... Rule(key="name",
... extractor=Path("./td[1]/a/text()")),
... Rule(key="character",
... extractor=Path("./td[2]/text()"))
... ],
... transform=lambda x: "%(name)s as %(character)s" % x
... ))
... ]
>>> Rules(rules).extract(root)
{'cast': ['Jack Nicholson as Jack Torrance',
'Shelley Duvall as Wendy Torrance']}
A rules object can have a ``section`` attribute as described in the data
extraction chapter:
.. code-block:: python
>>> rules = [
... Rule(key="director",
... extractor=Rules(
... section='//div[@class="director"]//a',
... rules=[
... Rule(key="name",
... extractor=Path("./text()")),
... Rule(key="link",
... extractor=Path("./@href"))
... ]))
... ]
>>> Rules(rules).extract(root)
{'director': {'link': '/people/1', 'name': 'Stanley Kubrick'}}
PK ! b2 docs/source/overview.rstOverview
========
Scraping a document consists of three stages:
#. Building a DOM tree out of the document. This is a straightforward
operation for an XML document. For an HTML document, Piculet will first
try to convert it into XHTML and then build the tree from that.
#. Preprocessing the tree. This is an optional stage. In some cases
it might be helpful to do some changes on the tree to simplify
the extraction process.
#. Extracting data out of the tree.
The preprocessing and extraction stages are expressed as part of a scraping
specification. The specification is a mapping which can be stored
in a file format that can represent a mapping, such as JSON or YAML.
Details about the specification are given in later chapters.
Command Line Interface
----------------------
Installing Piculet creates a script named ``piculet`` which can be used
to invoke the command line interface::
$ piculet -h
usage: piculet [-h] [--debug] command ...
The ``scrape`` command extracts data out of a document as described by
a specification file::
$ piculet scrape -h
usage: piculet scrape [-h] -s SPEC [--html] document
The location of the document can be given as a file path or a URL.
For example, say you want to extract some data from the file `shining.html`_.
An example specification is given in `movie.json`_.
Download both of these files and run the command::
$ piculet scrape -s movie.json shining.html
This should print the following output::
{
"cast": [
{
"character": "Jack Torrance",
"link": "/people/2",
"name": "Jack Nicholson"
},
{
"character": "Wendy Torrance",
"link": "/people/3",
"name": "Shelley Duvall"
}
],
"director": {
"link": "/people/1",
"name": "Stanley Kubrick"
},
"genres": [
"Horror",
"Drama"
],
"language": "English",
"review": "Fantastic movie. Definitely recommended.",
"runtime": "144 minutes",
"title": "The Shining",
"year": 1980
}
For HTML documents, the ``--html`` option has to be used. If the document
address starts with ``http://`` or ``https://``, the content will be taken
from the given URL. For example, to extract some data from the Wikipedia page
for `David Bowie`_, download the `wikipedia.json`_ file and run the command::
piculet scrape -s wikipedia.json --html "https://en.wikipedia.org/wiki/David_Bowie"
This should print the following output::
{
"birthplace": "Brixton, London, England",
"born": "1947-01-08",
"name": "David Bowie",
"occupation": [
"Singer",
"songwriter",
"actor"
]
}
In the same command, change the name part of the URL to ``Merlene_Ottey`` and
you will get similar data for `Merlene Ottey`_. Note that since the markup
used in Wikipedia pages for persons varies, the kinds of data you get
with this specification will also vary.
Piculet can be used as a simplistic HTML to XHTML convertor by invoking it with
the ``h2x`` command. This command takes the file name as input and prints
the converted content, as in ``piculet h2x foo.html``. If the input file name
is given as ``-`` it will read the content from the standard input
and therefore can be used as part of a pipe:
``cat foo.html | piculet h2x -``
Using in programs
-----------------
The scraping operation can also be invoked programmatically using
the :func:`scrape_document ` function. Note that
this function prints its output and doesn't return anything:
.. code-block:: python
from piculet import scrape_document
url = "https://en.wikipedia.org/wiki/David_Bowie"
spec = "wikipedia.json"
scrape_document(url, spec, content_format="html")
YAML support
------------
To use YAML for specification, Piculet has to be installed with YAML support::
pip install piculet[yaml]
Note that this will install an external module for parsing YAML files,
and therefore will not be contained to the standard library anymore.
The YAML version of the configuration example above can be found in
`movie.yaml`_.
.. _shining.html: https://github.com/uyar/piculet/blob/master/examples/shining.html
.. _movie.json: https://github.com/uyar/piculet/blob/master/examples/movie.json
.. _movie.yaml: https://github.com/uyar/piculet/blob/master/examples/movie.yaml
.. _wikipedia.json: https://github.com/uyar/piculet/blob/master/examples/wikipedia.json
.. _David Bowie: https://en.wikipedia.org/wiki/David_Bowie
.. _Merlene Ottey: https://en.wikipedia.org/wiki/Merlene_Ottey
PK ! UQ Q docs/source/preprocess.rstPreprocessing
=============
Other than extraction rules, specifications can also contain preprocessing
operations which allow modifications on the tree before starting
data extraction. Such operations can be needed to make data extraction
simpler or to remove the need for some postprocessing operations
on the collected data.
The syntax for writing preprocessing operations is as follows:
.. code-block:: python
rules = {
"pre": [
{
"op": "...",
...
},
{
"op": "...",
...
}
],
"items": [ ... ]
}
Every preprocessing operation item has a name which is given as the value
of the "op" key. The other items in the mapping are specific to the operation.
The operations are applied in the order as they are written in the operations
list.
The predefined preprocessing operations are explained below.
Removing elements
-----------------
This operation removes from the tree all the elements (and its subtree)
that are selected by a given XPath query:
.. code-block:: python
{"op": "remove", "path": "..."}
Setting element attributes
--------------------------
This operation selects all elements by a given XPath query and
sets an attribute for these elements to a given value:
.. code-block:: python
{"op": "set_attr", "path": "...", "name": "...", "value": "..."}
The attribute "name" can be a literal string or an extractor as described
in the data extraction chapter. Similarly, the attribute "value" can be given
as a literal string or an extractor.
Setting element text
--------------------
This operation selects all elements by a given XPath query and
sets their texts to a given value:
.. code-block:: python
{"op": "set_text", "path": "...", "text": "..."}
The "text" can be a literal string or an extractor.
PK ! VN8 8
piculet.py# Copyright (C) 2014-2019 H. Turgut Uyar
#
# Piculet is free software: you can redistribute it and/or modify
# it under the terms of the GNU Lesser General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# Piculet is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU Lesser General Public License for more details.
#
# You should have received a copy of the GNU Lesser General Public License
# along with Piculet. If not, see .
"""Piculet is a module for scraping XML and HTML documents using XPath queries.
It consists of this single source file with no dependencies other than
the standard library, which makes it very easy to integrate into applications.
For more information, please refer to the documentation:
https://piculet.readthedocs.io/
"""
from __future__ import absolute_import, division, print_function, unicode_literals
import json
import logging
import os
import re
import sys
from argparse import ArgumentParser
from collections import deque
from functools import partial
from operator import itemgetter
from pkgutil import find_loader
__version__ = "1.0.1"
PY2 = sys.version_info < (3, 0)
if PY2:
str, bytes = unicode, str
if PY2:
from cgi import escape as html_escape
from HTMLParser import HTMLParser
from StringIO import StringIO
from htmlentitydefs import name2codepoint
from urllib2 import urlopen
else:
from html import escape as html_escape
from html.parser import HTMLParser
from io import StringIO
from urllib.request import urlopen
if PY2:
from contextlib import contextmanager
@contextmanager
def redirect_stdout(new_stdout):
"""Context manager for temporarily redirecting stdout."""
old_stdout, sys.stdout = sys.stdout, new_stdout
try:
yield new_stdout
finally:
sys.stdout = old_stdout
else:
from contextlib import redirect_stdout
_logger = logging.getLogger(__name__)
###########################################################
# HTML OPERATIONS
###########################################################
# TODO: this is too fragile
_CHARSET_TAGS = [
b' str
:param content: Content of HTML document to decode.
:param charset: Character set of the page.
:param fallback_charset: Character set to use if it can't be figured out.
:return: Decoded content of the document.
"""
if charset is None:
for tag in _CHARSET_TAGS:
start = content.find(tag)
if start >= 0:
charset_start = start + len(tag)
charset_end = content.find(b'"', charset_start)
charset = content[charset_start:charset_end].decode("ascii")
_logger.debug("charset found in : %s", charset)
break
else:
_logger.debug("charset not found, using fallback: %s", fallback_charset)
charset = fallback_charset
_logger.debug("decoding for charset: %s", charset)
return content.decode(charset)
class HTMLNormalizer(HTMLParser):
"""HTML cleaner and XHTML convertor.
DOCTYPE declarations and comments are removed.
"""
VOID_ELEMENTS = frozenset(
{
"area",
"base",
"basefont",
"bgsound",
"br",
"col",
"command",
"embed",
"frame",
"hr",
"image",
"img",
"input",
"isindex",
"keygen",
"link",
"menuitem",
"meta",
"nextid",
"param",
"source",
"track",
"wbr",
}
)
"""Tags to handle as self-closing."""
def __init__(self, omit_tags=None, omit_attrs=None):
"""Initialize this normalizer.
:sig: (Optional[Iterable[str]], Optional[Iterable[str]]) -> None
:param omit_tags: Tags to remove, along with all their content.
:param omit_attrs: Attributes to remove.
"""
if PY2:
HTMLParser.__init__(self)
else:
super().__init__(convert_charrefs=True)
self.omit_tags = set(omit_tags) if omit_tags is not None else set() # sig: Set[str]
self.omit_attrs = set(omit_attrs) if omit_attrs is not None else set() # sig: Set[str]
# stacks used during normalization
self._open_tags = deque()
self._open_omitted_tags = deque()
def handle_starttag(self, tag, attrs):
"""Process the starting of a new element."""
if tag in self.omit_tags:
_logger.debug("omitting starting tag: <%s>", tag)
self._open_omitted_tags.append(tag)
if not self._open_omitted_tags:
# stack empty -> not in omit mode
if "@" in tag:
# email address in angular brackets
print("<%s>" % tag, end="")
return
if (tag == "li") and (self._open_tags[-1] == "li"):
_logger.debug("opened
without closing previous
, adding
")
self.handle_endtag("li")
attributes = []
for attr_name, attr_value in attrs:
if attr_name in self.omit_attrs:
_logger.debug("omitting attribute of <%s>: %s", tag, attr_name)
continue
if attr_value is None:
_logger.debug(
"adding empty value for attribute of <%s>: %s", tag, attr_name
)
attr_value = ""
markup = '%(name)s="%(value)s"' % {
"name": attr_name,
"value": html_escape(attr_value, quote=True),
}
attributes.append(markup)
line = "<%(tag)s%(attrs)s%(slash)s>" % {
"tag": tag,
"attrs": (" " + " ".join(attributes)) if len(attributes) > 0 else "",
"slash": " /" if tag in self.VOID_ELEMENTS else "",
}
print(line, end="")
if tag not in self.VOID_ELEMENTS:
self._open_tags.append(tag)
def handle_endtag(self, tag):
"""Process the ending of an element."""
if not self._open_omitted_tags:
# stack empty -> not in omit mode
if tag not in self.VOID_ELEMENTS:
last = self._open_tags[-1]
if (tag == "ul") and (last == "li"):
_logger.debug("closing
without closing last
, adding
")
self.handle_endtag("li")
if tag == last:
# expected end tag
print("%(tag)s>" % {"tag": tag}, end="")
self._open_tags.pop()
elif tag not in self._open_tags:
_logger.debug("closing tag without opening tag: <%s>", tag)
# XXX: for , this case gets invoked after the case below
elif tag == self._open_tags[-2]:
_logger.debug(
"unexpected closing tag <%s> instead of <%s>, closing both", tag, last
)
print("%(tag)s>" % {"tag": last}, end="")
print("%(tag)s>" % {"tag": tag}, end="")
self._open_tags.pop()
self._open_tags.pop()
elif (tag in self.omit_tags) and (tag == self._open_omitted_tags[-1]):
# end of expected omitted tag
self._open_omitted_tags.pop()
def handle_data(self, data):
"""Process collected character data."""
if not self._open_omitted_tags:
# stack empty -> not in omit mode
line = html_escape(data)
print(line.decode("utf-8") if PY2 and isinstance(line, bytes) else line, end="")
def handle_entityref(self, name):
"""Process an entity reference."""
# XXX: doesn't get called if convert_charrefs=True
num = name2codepoint.get(name) # we are sure we're on PY2 here
if num is not None:
print("%(ref)d;" % {"ref": num}, end="")
def handle_charref(self, name):
"""Process a character reference."""
# XXX: doesn't get called if convert_charrefs=True
print("%(ref)s;" % {"ref": name}, end="")
# def feed(self, data):
# super().feed(data)
# # close all remaining open tags
# for tag in reversed(self._open_tags):
# print('%(tag)s>' % {'tag': tag}, end='')
def html_to_xhtml(document, omit_tags=None, omit_attrs=None):
"""Clean HTML and convert to XHTML.
:sig: (str, Optional[Iterable[str]], Optional[Iterable[str]]) -> str
:param document: HTML document to clean and convert.
:param omit_tags: Tags to exclude from the output.
:param omit_attrs: Attributes to exclude from the output.
:return: Normalized XHTML content.
"""
out = StringIO()
normalizer = HTMLNormalizer(omit_tags=omit_tags, omit_attrs=omit_attrs)
with redirect_stdout(out):
normalizer.feed(document)
return out.getvalue()
###########################################################
# DATA EXTRACTION OPERATIONS
###########################################################
# sigalias: XPathResult = Union[Sequence[str], Sequence[Element]]
_USE_LXML = find_loader("lxml") is not None
if _USE_LXML:
_logger.info("using lxml")
from lxml import etree as ElementTree
from lxml.etree import Element
XPath = ElementTree.XPath
xpath = ElementTree._Element.xpath
else:
from xml.etree import ElementTree
from xml.etree.ElementTree import Element
class XPath:
"""An XPath expression evaluator.
This class is mainly needed to compensate for the lack of ``text()``
and ``@attr`` axis queries in ElementTree XPath support.
"""
def __init__(self, path):
"""Initialize this evaluator.
:sig: (str) -> None
:param path: XPath expression to evaluate.
"""
if path[0] == "/":
# ElementTree doesn't support absolute paths
# TODO: handle this properly, find root of tree
path = "." + path
def descendant(element):
# strip trailing '//text()'
return [t for e in element.findall(path[:-8]) for t in e.itertext() if t]
def child(element):
# strip trailing '/text()'
return [
t
for e in element.findall(path[:-7])
for t in ([e.text] + [c.tail if c.tail else "" for c in e])
if t
]
def attribute(element, subpath, attr):
result = [e.attrib.get(attr) for e in element.findall(subpath)]
return [r for r in result if r is not None]
if path.endswith("//text()"):
_apply = descendant
elif path.endswith("/text()"):
_apply = child
else:
steps = path.split("/")
front, last = steps[:-1], steps[-1]
# after dropping PY2: *front, last = path.split('/')
if last.startswith("@"):
_apply = partial(attribute, subpath="/".join(front), attr=last[1:])
else:
_apply = partial(Element.findall, path=path)
self._apply = _apply # sig: Callable[[Element], XPathResult]
def __call__(self, element):
"""Apply this evaluator to an element.
:sig: (Element) -> XPathResult
:param element: Element to apply this expression to.
:return: Elements or strings resulting from the query.
"""
return self._apply(element)
xpath = lambda e, p: XPath(p)(e)
_EMPTY = {} # sig: Dict
# sigalias: Reducer = Callable[[Sequence[str]], str]
# sigalias: PathTransformer = Callable[[str], Any]
# sigalias: MapTransformer = Callable[[Mapping[str, Any]], Any]
# sigalias: Transformer = Union[PathTransformer, MapTransformer]
# sigalias: ExtractedItem = Union[str, Mapping[str, Any]]
class Extractor:
"""Abstract base extractor for getting data out of an XML element."""
def __init__(self, transform=None, foreach=None):
"""Initialize this extractor.
:sig: (Optional[Transformer], Optional[str]) -> None
:param transform: Function to transform the extracted value.
:param foreach: Path to apply for generating a collection of values.
"""
self.transform = transform # sig: Optional[Transformer]
"""Function to transform the extracted value."""
self.foreach = XPath(foreach) if foreach is not None else None # sig: Optional[XPath]
"""Path to apply for generating a collection of values."""
def apply(self, element):
"""Get the raw data from an element using this extractor.
:sig: (Element) -> ExtractedItem
:param element: Element to apply this extractor to.
:return: Extracted raw data.
"""
raise NotImplementedError("Concrete extractors must implement this method")
def extract(self, element, transform=True):
"""Get the processed data from an element using this extractor.
:sig: (Element, Optional[bool]) -> Any
:param element: Element to extract the data from.
:param transform: Whether the transformation will be applied or not.
:return: Extracted and optionally transformed data.
"""
value = self.apply(element)
if (value is None) or (value is _EMPTY) or (not transform):
return value
return value if self.transform is None else self.transform(value)
@staticmethod
def from_map(item):
"""Generate an extractor from a description map.
:sig: (Mapping[str, Any]) -> Extractor
:param item: Extractor description.
:return: Extractor object.
:raise ValueError: When reducer or transformer names are unknown.
"""
transformer = item.get("transform")
if transformer is None:
transform = None
else:
transform = transformers.get(transformer)
if transform is None:
raise ValueError("Unknown transformer")
foreach = item.get("foreach")
path = item.get("path")
if path is not None:
reducer = item.get("reduce")
if reducer is None:
reduce = None
else:
reduce = reducers.get(reducer)
if reduce is None:
raise ValueError("Unknown reducer")
extractor = Path(path, reduce, transform=transform, foreach=foreach)
else:
items = item.get("items")
# TODO: check for None
rules = [Rule.from_map(i) for i in items]
extractor = Rules(
rules, section=item.get("section"), transform=transform, foreach=foreach
)
return extractor
class Path(Extractor):
"""An extractor for getting text out of an XML element."""
def __init__(self, path, reduce=None, transform=None, foreach=None):
"""Initialize this extractor.
:sig: (
str,
Optional[Reducer],
Optional[PathTransformer],
Optional[str]
) -> None
:param path: Path to apply to get the data.
:param reduce: Function to reduce selected texts into a single string.
:param transform: Function to transform extracted value.
:param foreach: Path to apply for generating a collection of data.
"""
if PY2:
Extractor.__init__(self, transform=transform, foreach=foreach)
else:
super().__init__(transform=transform, foreach=foreach)
self.path = XPath(path) # sig: XPath
"""XPath evaluator to apply to get the data."""
if reduce is None:
reduce = reducers.concat
self.reduce = reduce # sig: Reducer
"""Function to reduce selected texts into a single string."""
def apply(self, element):
"""Apply this extractor to an element.
:sig: (Element) -> str
:param element: Element to apply this extractor to.
:return: Extracted text.
"""
# _logger.debug("applying path on <%s>: %s", element.tag, self.path)
selected = self.path(element)
if len(selected) == 0:
# _logger.debug("no match")
value = None
else:
# _logger.debug("selected elements: %s", selected)
value = self.reduce(selected)
# _logger.debug("reduced using %s: %s", self.reduce, value)
return value
class Rules(Extractor):
"""An extractor for getting data items out of an XML element."""
def __init__(self, rules, section=None, transform=None, foreach=None):
"""Initialize this extractor.
:sig:
(
Sequence[Rule],
str,
Optional[MapTransformer],
Optional[str]
) -> None
:param rules: Rules for generating the data items.
:param section: Path for setting the root of this section.
:param transform: Function to transform extracted value.
:param foreach: Path for generating multiple items.
"""
if PY2:
Extractor.__init__(self, transform=transform, foreach=foreach)
else:
super().__init__(transform=transform, foreach=foreach)
self.rules = rules # sig: Sequence[Rule]
"""Rules for generating the data items."""
self.section = XPath(section) if section is not None else None # sig: Optional[XPath]
"""XPath expression for selecting a subroot for this section."""
def apply(self, element):
"""Apply this extractor to an element.
:sig: (Element) -> Mapping[str, Any]
:param element: Element to apply the extractor to.
:return: Extracted mapping.
"""
if self.section is None:
subroot = element
else:
subroots = self.section(element)
if len(subroots) == 0:
_logger.debug("no section root found")
return _EMPTY
if len(subroots) > 1:
raise ValueError("Section path should select exactly one element")
subroot = subroots[0]
_logger.debug("setting root: <%s>", subroot.tag)
data = {}
for rule in self.rules:
extracted = rule.extract(subroot)
data.update(extracted)
return data if len(data) > 0 else _EMPTY
class Rule:
"""A rule describing how to get a data item out of an XML element."""
def __init__(self, key, extractor, foreach=None):
"""Initialize this rule.
:sig: (Union[str, Extractor], Extractor, Optional[str]) -> None
:param key: Name to distinguish this data item.
:param extractor: Extractor that will generate this data item.
:param foreach: Path for generating multiple items.
"""
self.key = key # sig: Union[str, Extractor]
"""Name to distinguish this data item."""
self.extractor = extractor # sig: Extractor
"""Extractor that will generate this data item."""
self.foreach = XPath(foreach) if foreach is not None else None # sig: Optional[XPath]
"""XPath evaluator for generating multiple items."""
@staticmethod
def from_map(item):
"""Generate a rule from a description map.
:sig: (Mapping[str, Any]) -> Rule
:param item: Item description.
:return: Rule object.
"""
item_key = item["key"]
key = item_key if isinstance(item_key, str) else Extractor.from_map(item_key)
value = Extractor.from_map(item["value"])
return Rule(key=key, extractor=value, foreach=item.get("foreach"))
def extract(self, element):
"""Extract data out of an element using this rule.
:sig: (Element) -> Mapping[str, Any]
:param element: Element to extract the data from.
:return: Extracted data.
"""
data = {}
subroots = [element] if self.foreach is None else self.foreach(element)
for subroot in subroots:
# _logger.debug("setting section element: <%s>", subroot.tag)
key = self.key if isinstance(self.key, str) else self.key.extract(subroot)
if key is None:
# _logger.debug("no value generated for key name")
continue
# _logger.debug("extracting key: %s", key)
if self.extractor.foreach is None:
value = self.extractor.extract(subroot)
if (value is None) or (value is _EMPTY):
# _logger.debug("no value generated for key")
continue
data[key] = value
# _logger.debug("extracted value for %s: %s", key, data[key])
else:
# don't try to transform list items by default, it might waste a lot of time
raw_values = [
self.extractor.extract(r, transform=False)
for r in self.extractor.foreach(subroot)
]
values = [v for v in raw_values if (v is not None) and (v is not _EMPTY)]
if len(values) == 0:
# _logger.debug("no items found in list")
continue
data[key] = (
values
if self.extractor.transform is None
else list(map(self.extractor.transform, values))
)
# _logger.debug("extracted value for %s: %s", key, data[key])
return data
def remove_elements(root, path):
"""Remove selected elements from the tree.
:sig: (Element, str) -> None
:param root: Root element of the tree.
:param path: XPath to select the elements to remove.
"""
if _USE_LXML:
get_parent = ElementTree._Element.getparent
else:
# ElementTree doesn't support parent queries, so we'll build a map for it
get_parent = root.attrib.get("_get_parent")
if get_parent is None:
get_parent = {e: p for p in root.iter() for e in p}.get
root.attrib["_get_parent"] = get_parent
elements = XPath(path)(root)
_logger.debug("removing %s elements using path: %s", len(elements), path)
if len(elements) > 0:
for element in elements:
_logger.debug("removing element: <%s>", element.tag)
# XXX: could this be hazardous? parent removed in earlier iteration?
get_parent(element).remove(element)
def set_element_attr(root, path, name, value):
"""Set an attribute for selected elements.
:sig:
(
Element,
str,
Union[str, Mapping[str, Any]],
Union[str, Mapping[str, Any]]
) -> None
:param root: Root element of the tree.
:param path: XPath to select the elements to set attributes for.
:param name: Description for name generation.
:param value: Description for value generation.
"""
elements = XPath(path)(root)
_logger.debug("updating %s elements using path: %s", len(elements), path)
for element in elements:
attr_name = name if isinstance(name, str) else Extractor.from_map(name).extract(element)
if attr_name is None:
_logger.debug("no attribute name generated for <%s>:", element.tag)
continue
attr_value = (
value if isinstance(value, str) else Extractor.from_map(value).extract(element)
)
if attr_value is None:
_logger.debug("no attribute value generated for <%s>:", element.tag)
continue
_logger.debug("setting %s attribute of <%s>: %s", attr_name, element.tag, attr_value)
element.attrib[attr_name] = attr_value
def set_element_text(root, path, text):
"""Set the text for selected elements.
:sig: (Element, str, Union[str, Mapping[str, Any]]) -> None
:param root: Root element of the tree.
:param path: XPath to select the elements to set attributes for.
:param text: Description for text generation.
"""
elements = XPath(path)(root)
_logger.debug("updating %s elements using path: %s", len(elements), path)
for element in elements:
element_text = (
text if isinstance(text, str) else Extractor.from_map(text).extract(element)
)
# note that the text can be None in which case the existing text will be cleared
_logger.debug("setting text of <%s>: %s", element.tag, element_text)
element.text = element_text
def build_tree(document, lxml_html=False):
"""Build a tree from an XML document.
:sig: (str, Optional[bool]) -> Element
:param document: XML document to build the tree from.
:param lxml_html: Use the lxml.html builder if available.
:return: Root element of the XML tree.
"""
content = document.encode("utf-8") if PY2 else document
if _USE_LXML and lxml_html:
_logger.info("using lxml html builder")
import lxml.html
return lxml.html.fromstring(content)
return ElementTree.fromstring(content)
class Registry:
"""A simple, attribute-based namespace."""
def __init__(self, entries):
"""Initialize this registry.
:sig: (Mapping[str, Any]) -> None
:param entries: Entries to add to this registry.
"""
self.__dict__.update(entries)
def get(self, item):
"""Get the value of an entry from this registry.
:sig: (str) -> Any
:param item: Entry to get the value for.
:return: Value of entry.
"""
return self.__dict__.get(item)
def register(self, key, value):
"""Register a new entry in this registry.
:sig: (str, Any) -> None
:param key: Key to search the entry in this registry.
:param value: Value to store for the entry.
"""
self.__dict__[key] = value
_PREPROCESSORS = {
"remove": remove_elements,
"set_attr": set_element_attr,
"set_text": set_element_text,
}
preprocessors = Registry(_PREPROCESSORS) # sig: Registry
"""Predefined preprocessors."""
_REDUCERS = {
"first": itemgetter(0),
"concat": partial(str.join, ""),
"clean": lambda xs: re.sub(r"\s+", " ", "".join(xs).replace("\xa0", " ")).strip(),
"normalize": lambda xs: re.sub(r"[^a-z0-9_]", "", "".join(xs).lower().replace(" ", "_")),
}
reducers = Registry(_REDUCERS) # sig: Registry
"""Predefined reducers."""
_TRANSFORMERS = {
"int": int,
"float": float,
"bool": bool,
"len": len,
"lower": str.lower,
"upper": str.upper,
"capitalize": str.capitalize,
"lstrip": str.lstrip,
"rstrip": str.rstrip,
"strip": str.strip,
}
transformers = Registry(_TRANSFORMERS) # sig: Registry
"""Predefined transformers."""
def preprocess(root, pre):
"""Process a tree before starting extraction.
:sig: (Element, Sequence[Mapping[str, Any]]) -> None
:param root: Root of tree to process.
:param pre: Descriptions for processing operations.
"""
for step in pre:
op = step["op"]
if op == "remove":
remove_elements(root, step["path"])
elif op == "set_attr":
set_element_attr(root, step["path"], name=step["name"], value=step["value"])
elif op == "set_text":
set_element_text(root, step["path"], text=step["text"])
else:
raise ValueError("Unknown preprocessing operation")
def extract(element, items, section=None):
"""Extract data from an XML element.
:sig:
(
Element,
Sequence[Mapping[str, Any]],
Optional[str]
) -> Mapping[str, Any]
:param element: Element to extract the data from.
:param items: Descriptions for extracting items.
:param section: Path to select the root element for these items.
:return: Extracted data.
"""
rules = Rules([Rule.from_map(item) for item in items], section=section)
return rules.extract(element)
def scrape(document, spec, lxml_html=False):
"""Extract data from a document after optionally preprocessing it.
:sig: (str, Mapping[str, Any], Optional[bool]) -> Mapping[str, Any]
:param document: Document to scrape.
:param spec: Extraction specification.
:param lxml_html: Use the lxml.html builder if available.
:return: Extracted data.
"""
root = build_tree(document, lxml_html=lxml_html)
pre = spec.get("pre")
if pre is not None:
preprocess(root, pre)
data = extract(root, spec.get("items"), section=spec.get("section"))
return data
###########################################################
# COMMAND-LINE INTERFACE
###########################################################
def h2x(source):
"""Convert an HTML file into XHTML and print.
:sig: (str) -> None
:param source: Path of HTML file to convert.
"""
if source == "-":
_logger.debug("reading from stdin")
content = sys.stdin.read()
else:
_logger.debug("reading from file: %s", os.path.abspath(source))
with open(source, "rb") as f:
content = decode_html(f.read())
print(html_to_xhtml(content), end="")
def scrape_document(address, spec, content_format="xml"):
"""Scrape data from a file path or a URL and print.
:sig: (str, str, Optional[str]) -> None
:param address: File path or URL of document to scrape.
:param spec: Path of spec file.
:param content_format: Whether the content is XML or HTML.
"""
_logger.debug("loading spec from file: %s", os.path.abspath(spec))
if os.path.splitext(spec)[-1] in (".yaml", ".yml"):
if find_loader("yaml") is None:
raise RuntimeError("YAML support not available")
import yaml
spec_loader = yaml.load
else:
spec_loader = json.loads
with open(spec) as f:
spec_map = spec_loader(f.read())
if address.startswith(("http://", "https://")):
_logger.debug("loading url: %s", address)
with urlopen(address) as f:
content = f.read()
else:
_logger.debug("loading file: %s", os.path.abspath(address))
with open(address, "rb") as f:
content = f.read()
document = decode_html(content)
if content_format == "html":
_logger.debug("converting html document to xhtml")
document = html_to_xhtml(document)
# _logger.debug('=== CONTENT START ===\n%s\n=== CONTENT END===', document)
data = scrape(document, spec_map)
print(json.dumps(data, indent=2, sort_keys=True))
def make_parser(prog):
"""Build a parser for command line arguments.
:sig: (str) -> ArgumentParser
:param prog: Name of program.
:return: Parser for arguments.
"""
parser = ArgumentParser(prog=prog)
parser.add_argument("--version", action="version", version="%(prog)s " + __version__)
parser.add_argument("--debug", action="store_true", help="enable debug messages")
commands = parser.add_subparsers(metavar="command", dest="command")
commands.required = True
h2x_parser = commands.add_parser("h2x", help="convert HTML to XHTML")
h2x_parser.add_argument("file", help="file to convert")
h2x_parser.set_defaults(func=lambda a: h2x(a.file))
scrape_parser = commands.add_parser("scrape", help="scrape a document")
scrape_parser.add_argument("document", help="file path or URL of document to scrape")
scrape_parser.add_argument("-s", "--spec", required=True, help="spec file")
scrape_parser.add_argument("--html", action="store_true", help="document is in HTML format")
scrape_parser.set_defaults(
func=lambda a: scrape_document(
a.document, a.spec, content_format="html" if a.html else "xml"
)
)
return parser
def main(argv=None):
"""Entry point of the command line utility.
:sig: (Optional[List[str]]) -> None
:param argv: Command line arguments.
"""
argv = argv if argv is not None else sys.argv
parser = make_parser(prog="piculet")
arguments = parser.parse_args(argv[1:])
# set debug mode
if arguments.debug:
logging.basicConfig(level=logging.DEBUG)
_logger.debug("running in debug mode")
# run the handler for the selected command
try:
arguments.func(arguments)
except Exception as e:
print(e, file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
main()
PK ! H tests/conftest.pyfrom __future__ import absolute_import, division, print_function, unicode_literals
from pytest import fixture
import logging
import os
import sys
from hashlib import md5
from io import BytesIO
import piculet
PY2 = sys.version_info < (3, 0)
if PY2:
import mock
else:
from unittest import mock
if PY2:
from urllib2 import urlopen
else:
from urllib.request import urlopen
logging.raiseExceptions = False
cache_dir = os.path.join(os.path.dirname(__file__), ".cache")
if not os.path.exists(cache_dir):
os.makedirs(cache_dir)
def mock_urlopen(url):
key = md5(url.encode("utf-8")).hexdigest()
cache_file = os.path.join(cache_dir, key)
if not os.path.exists(cache_file):
content = urlopen(url).read()
with open(cache_file, "wb") as f:
f.write(content)
else:
with open(cache_file, "rb") as f:
content = f.read()
return BytesIO(content)
piculet.urlopen = mock.Mock(wraps=mock_urlopen)
@fixture(scope="session")
def shining_content():
"""Contents of the shining.html file."""
file_path = os.path.join(os.path.dirname(__file__), "..", "examples", "shining.html")
with open(file_path) as f:
content = f.read()
return content
@fixture
def shining(shining_content):
"""Root element of the XML tree for the movie document "The Shining"."""
return piculet.build_tree(shining_content)
PK ! ل tests/test_cli.pyfrom __future__ import absolute_import, division, print_function, unicode_literals
from pytest import config, mark, raises
import json
import logging
import os
import sys
from io import StringIO
from pkg_resources import get_distribution
import piculet
if sys.version_info.major < 3:
import mock
else:
from unittest import mock
base_dir = os.path.dirname(__file__)
wikipedia_spec = os.path.join(base_dir, "..", "examples", "wikipedia.json")
def test_version():
assert get_distribution("piculet").version == piculet.__version__
def test_help_should_print_usage_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "--help"])
out, err = capsys.readouterr()
assert out.startswith("usage: ")
def test_version_should_print_version_number_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "--version"])
out, err = capsys.readouterr()
assert "piculet " + get_distribution("piculet").version + "\n" in {out, err}
def test_no_command_should_print_usage_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet"])
out, err = capsys.readouterr()
assert err.startswith("usage: ")
assert ("required: command" in err) or ("too few arguments" in err)
def test_invalid_command_should_print_usage_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "foo"])
out, err = capsys.readouterr()
assert err.startswith("usage: ")
assert ("invalid choice: 'foo'" in err) or ("invalid choice: u'foo'" in err)
def test_unrecognized_arguments_should_print_usage_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "--foo", "h2x", ""])
out, err = capsys.readouterr()
assert err.startswith("usage: ")
assert "unrecognized arguments: --foo" in err
def test_debug_mode_should_print_debug_messages(caplog):
caplog.set_level(logging.DEBUG)
with mock.patch("sys.stdin", StringIO("")):
piculet.main(argv=["piculet", "--debug", "h2x", "-"])
assert caplog.record_tuples[0][-1] == "running in debug mode"
def test_h2x_no_input_should_print_usage_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "h2x"])
out, err = capsys.readouterr()
assert err.startswith("usage: ")
assert ("required: file" in err) or ("too few arguments" in err)
@mark.skipif(sys.platform not in {"linux", "linux2"}, reason="/dev/shm only available on linux")
def test_h2x_should_read_given_file(capsys):
content = ""
with open("/dev/shm/test.html", "w") as f:
f.write(content)
piculet.main(argv=["piculet", "h2x", "/dev/shm/test.html"])
out, err = capsys.readouterr()
os.unlink("/dev/shm/test.html")
assert out == content
def test_h2x_should_read_stdin_when_input_is_dash(capsys):
content = ""
with mock.patch("sys.stdin", StringIO(content)):
piculet.main(argv=["piculet", "h2x", "-"])
out, err = capsys.readouterr()
assert out == content
def test_scrape_no_url_should_print_usage_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "scrape", "-s", wikipedia_spec])
out, err = capsys.readouterr()
assert err.startswith("usage: ")
assert ("required: document" in err) or ("too few arguments" in err)
def test_scrape_no_spec_should_print_usage_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "scrape", "http://www.foo.com/"])
out, err = capsys.readouterr()
assert err.startswith("usage: ")
assert ("required: -s" in err) or ("--spec is required" in err)
def test_scrape_missing_spec_file_should_fail_and_exit(capsys):
with raises(SystemExit):
piculet.main(argv=["piculet", "scrape", "http://www.foo.com/", "-s", "foo.json"])
out, err = capsys.readouterr()
assert "No such file or directory: " in err
def test_scrape_local_should_scrape_given_file(capsys):
dirname = os.path.join(os.path.dirname(__file__), "..", "examples")
shining = os.path.join(dirname, "shining.html")
spec = os.path.join(dirname, "movie.json")
piculet.main(argv=["piculet", "scrape", shining, "-s", spec])
out, err = capsys.readouterr()
data = json.loads(out)
assert data["title"] == "The Shining"
@mark.skipif(not config.getvalue("--cov"), reason="takes unforeseen amount of time")
def test_scrape_should_scrape_given_url(capsys):
piculet.main(
argv=[
"piculet",
"scrape",
"https://en.wikipedia.org/wiki/David_Bowie",
"-s",
wikipedia_spec,
"--html",
]
)
out, err = capsys.readouterr()
data = json.loads(out)
assert data["name"] == "David Bowie"
PK ! I[\! \! tests/test_extract.pyfrom __future__ import absolute_import, division, print_function, unicode_literals
from pytest import raises
from piculet import Path, Rule, Rules, build_tree, reducers, transformers
def test_no_rules_should_return_empty_result(shining):
data = Rules([]).extract(shining)
assert data == {}
def test_extracted_value_should_be_reduced(shining):
rules = [Rule(key="title", extractor=Path("//title/text()", reduce=reducers.first))]
data = Rules(rules).extract(shining)
assert data == {"title": "The Shining"}
def test_default_reducer_should_be_concat(shining):
rules = [Rule(key="full_title", extractor=Path("//h1//text()"))]
data = Rules(rules).extract(shining)
assert data == {"full_title": "The Shining (1980)"}
def test_added_reducer_should_be_usable(shining):
reducers.register("second", lambda x: x[1])
rules = [Rule(key="year", extractor=Path("//h1//text()", reduce=reducers.second))]
data = Rules(rules).extract(shining)
assert data == {"year": "1980"}
def test_reduce_by_lambda_should_be_ok(shining):
rules = [Rule(key="title", extractor=Path("//title/text()", reduce=lambda xs: xs[0]))]
data = Rules(rules).extract(shining)
assert data == {"title": "The Shining"}
def test_reduced_value_should_be_transformable(shining):
rules = [Rule(key="year", extractor=Path('//span[@class="year"]/text()', transform=int))]
data = Rules(rules).extract(shining)
assert data == {"year": 1980}
def test_added_transformer_should_be_usable(shining):
transformers.register("year25", lambda x: int(x) + 25)
rules = [
Rule(
key="year",
extractor=Path('//span[@class="year"]/text()', transform=transformers.year25),
)
]
data = Rules(rules).extract(shining)
assert data == {"year": 2005}
def test_multiple_rules_should_generate_multiple_items(shining):
rules = [
Rule(key="title", extractor=Path("//title/text()")),
Rule("year", extractor=Path('//span[@class="year"]/text()', transform=int)),
]
data = Rules(rules).extract(shining)
assert data == {"title": "The Shining", "year": 1980}
def test_item_with_no_data_should_be_excluded(shining):
rules = [
Rule(key="title", extractor=Path("//title/text()")),
Rule(key="foo", extractor=Path("//foo/text()")),
]
data = Rules(rules).extract(shining)
assert data == {"title": "The Shining"}
def test_item_with_empty_str_value_should_be_included():
content = ''
rules = [Rule(key="foo", extractor=Path("//foo/@val"))]
data = Rules(rules).extract(build_tree(content))
assert data == {"foo": ""}
def test_item_with_zero_value_should_be_included():
content = ''
rules = [Rule(key="foo", extractor=Path("//foo/@val", transform=int))]
data = Rules(rules).extract(build_tree(content))
assert data == {"foo": 0}
def test_item_with_false_value_should_be_included():
content = ''
rules = [Rule(key="foo", extractor=Path("//foo/@val", transform=bool))]
data = Rules(rules).extract(build_tree(content))
assert data == {"foo": False}
def test_multivalued_item_should_be_list(shining):
rules = [
Rule(key="genres", extractor=Path(foreach='//ul[@class="genres"]/li', path="./text()"))
]
data = Rules(rules).extract(shining)
assert data == {"genres": ["Horror", "Drama"]}
def test_multivalued_items_should_be_transformable(shining):
rules = [
Rule(
key="genres",
extractor=Path(
foreach='//ul[@class="genres"]/li',
path="./text()",
transform=transformers.lower,
),
)
]
data = Rules(rules).extract(shining)
assert data == {"genres": ["horror", "drama"]}
def test_empty_values_should_be_excluded_from_multivalued_item_list(shining):
rules = [
Rule(key="foos", extractor=Path(foreach='//ul[@class="foos"]/li', path="./text()"))
]
data = Rules(rules).extract(shining)
assert data == {}
def test_subrules_should_generate_subitems(shining):
rules = [
Rule(
key="director",
extractor=Rules(
rules=[
Rule(key="name", extractor=Path('//div[@class="director"]//a/text()')),
Rule(key="link", extractor=Path('//div[@class="director"]//a/@href')),
]
),
)
]
data = Rules(rules).extract(shining)
assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}}
def test_multivalued_subrules_should_generate_list_of_subitems(shining):
rules = [
Rule(
key="cast",
extractor=Rules(
foreach='//table[@class="cast"]/tr',
rules=[
Rule(key="name", extractor=Path("./td[1]/a/text()")),
Rule(key="character", extractor=Path("./td[2]/text()")),
],
),
)
]
data = Rules(rules).extract(shining)
assert data == {
"cast": [
{"character": "Jack Torrance", "name": "Jack Nicholson"},
{"character": "Wendy Torrance", "name": "Shelley Duvall"},
]
}
def test_subitems_should_be_transformable(shining):
rules = [
Rule(
key="cast",
extractor=Rules(
foreach='//table[@class="cast"]/tr',
rules=[
Rule(key="name", extractor=Path("./td[1]/a/text()")),
Rule(key="character", extractor=Path("./td[2]/text()")),
],
transform=lambda x: "%(name)s as %(character)s" % x,
),
)
]
data = Rules(rules).extract(shining)
assert data == {
"cast": ["Jack Nicholson as Jack Torrance", "Shelley Duvall as Wendy Torrance"]
}
def test_key_should_be_generatable_using_path(shining):
rules = [
Rule(
foreach='//div[@class="info"]',
key=Path("./h3/text()"),
extractor=Path("./p/text()"),
)
]
data = Rules(rules).extract(shining)
assert data == {"Language:": "English", "Runtime:": "144 minutes"}
def test_generated_key_should_be_normalizable(shining):
rules = [
Rule(
foreach='//div[@class="info"]',
key=Path("./h3/text()", reduce=reducers.normalize),
extractor=Path("./p/text()"),
)
]
data = Rules(rules).extract(shining)
assert data == {"language": "English", "runtime": "144 minutes"}
def test_generated_key_should_be_transformable(shining):
rules = [
Rule(
foreach='//div[@class="info"]',
key=Path("./h3/text()", reduce=reducers.normalize, transform=lambda x: x.upper()),
extractor=Path("./p/text()"),
)
]
data = Rules(rules).extract(shining)
assert data == {"LANGUAGE": "English", "RUNTIME": "144 minutes"}
def test_generated_key_none_should_be_excluded(shining):
rules = [
Rule(
foreach='//div[@class="info"]',
key=Path("./foo/text()"),
extractor=Path("./p/text()"),
)
]
data = Rules(rules).extract(shining)
assert data == {}
def test_section_should_set_root_for_queries(shining):
rules = [
Rule(
key="director",
extractor=Rules(
section='//div[@class="director"]//a',
rules=[
Rule(key="name", extractor=Path("./text()")),
Rule(key="link", extractor=Path("./@href")),
],
),
)
]
data = Rules(rules).extract(shining)
assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}}
def test_section_no_roots_should_return_empty_result(shining):
rules = [
Rule(
key="director",
extractor=Rules(
section="//foo", rules=[Rule(key="name", extractor=Path("./text()"))]
),
)
]
data = Rules(rules).extract(shining)
assert data == {}
def test_section_multiple_roots_should_raise_error(shining):
with raises(ValueError):
rules = [
Rule(
key="director",
extractor=Rules(
section="//div", rules=[Rule(key="name", extractor=Path("./text()"))]
),
)
]
Rules(rules).extract(shining)
PK ! 9pX X tests/test_html.py# -*- coding: utf-8 -*-
from __future__ import absolute_import, division, print_function, unicode_literals
from pytest import raises
from piculet import decode_html, html_to_xhtml
TEMPLATE = """
%(meta)s
"""
def test_html_to_xhtml_unicode_attribute_value_should_be_preserved():
content = """"""
normalized = html_to_xhtml(content)
assert normalized == """"""
PK ! _ ) tests/test_preprocess.pyfrom __future__ import absolute_import, division, print_function, unicode_literals
from pytest import raises
from piculet import extract, preprocess
def test_unknown_preprocessor_should_raise_error(shining):
with raises(ValueError):
pre = [{"op": "foo", "path": "//tr[1]"}]
preprocess(shining, pre)
def test_remove_should_remove_selected_element(shining):
pre = [{"op": "remove", "path": "//tr[1]"}]
items = [
{
"key": "cast",
"value": {
"foreach": '//table[@class="cast"]/tr',
"items": [{"key": "name", "value": {"path": "./td[1]/a/text()"}}],
},
}
]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {"cast": [{"name": "Shelley Duvall"}]}
def test_remove_selected_none_should_not_cause_error(shining):
pre = [{"op": "remove", "path": "//tr[50]"}]
items = [
{
"key": "cast",
"value": {
"foreach": '//table[@class="cast"]/tr',
"items": [{"key": "name", "value": {"path": "./td[1]/a/text()"}}],
},
}
]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {"cast": [{"name": "Jack Nicholson"}, {"name": "Shelley Duvall"}]}
def test_set_attr_value_from_str_should_set_attribute_for_selected_elements(shining):
pre = [
{"op": "set_attr", "path": "//ul[@class='genres']/li", "name": "foo", "value": "bar"}
]
items = [{"key": "genres", "value": {"foreach": "//li[@foo='bar']", "path": "./text()"}}]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {"genres": ["Horror", "Drama"]}
def test_set_attr_value_from_path_should_set_attribute_for_selected_elements(shining):
pre = [
{
"op": "set_attr",
"path": '//ul[@class="genres"]/li',
"name": "foo",
"value": {"path": "./text()"},
}
]
items = [{"key": "genres", "value": {"foreach": "//li[@foo]", "path": "./@foo"}}]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {"genres": ["Horror", "Drama"]}
def test_set_attr_value_from_path_no_value_should_be_ignored(shining):
pre = [
{
"op": "set_attr",
"path": '//ul[@class="genres"]/li',
"name": "foo",
"value": {"path": "./@bar"},
}
]
items = [{"key": "genres", "value": {"foreach": "//li[@foo]", "path": "./@foo"}}]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {}
def test_set_attr_name_from_path_should_set_attribute_for_selected_elements(shining):
pre = [
{
"op": "set_attr",
"path": '//ul[@class="genres"]/li',
"name": {"path": "./text()"},
"value": "bar",
}
]
items = [{"key": "genres", "value": {"foreach": "//li[@Horror]", "path": "./@Horror"}}]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {"genres": ["bar"]}
def test_set_attr_name_from_path_no_value_should_be_ignored(shining):
pre = [
{
"op": "set_attr",
"path": '//ul[@class="genres"]/li',
"name": {"path": "./@bar"},
"value": "bar",
}
]
items = [{"key": "genres", "value": {"foreach": ".//li[@Horror]", "path": "./@Horror"}}]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {}
def test_set_attr_selected_none_should_not_cause_error(shining):
pre = [{"op": "set_attr", "path": "//foo", "name": "foo", "value": "bar"}]
items = [{"key": "genres", "value": {"foreach": '//li[@foo="bar"]', "path": "./@foo"}}]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {}
def test_set_text_value_from_str_should_set_text_for_selected_elements(shining):
pre = [{"op": "set_text", "path": '//ul[@class="genres"]/li', "text": "Foo"}]
items = [
{"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}}
]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {"genres": ["Foo", "Foo"]}
def test_set_text_value_from_path_should_set_text_for_selected_elements(shining):
pre = [
{
"op": "set_text",
"path": '//ul[@class="genres"]/li',
"text": {"path": "./text()", "transform": "lower"},
}
]
items = [
{"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}}
]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {"genres": ["horror", "drama"]}
def test_set_text_no_value_should_be_ignored(shining):
pre = [{"op": "set_text", "path": '//ul[@class="genres"]/li', "text": {"path": "./@foo"}}]
items = [
{"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}}
]
preprocess(shining, pre)
data = extract(shining, items)
assert data == {}
PK ! 1& & tests/test_reducers.pyfrom __future__ import absolute_import, division, print_function, unicode_literals
from piculet import reducers
def test_reducer_first_should_return_first_item():
assert reducers.first(["a", "b", "c"]) == "a"
def test_reducer_concat_should_return_concatenated_items():
assert reducers.concat(["a", "b", "c"]) == "abc"
def test_reducer_clean_should_remove_extra_space():
assert reducers.clean([" a ", " b", " c "]) == "a b c"
def test_reducer_clean_should_treat_nbsp_as_space():
assert reducers.clean([" a ", " \xa0 b", " c "]) == "a b c"
def test_reducer_normalize_should_convert_to_lowercase():
assert reducers.normalize(["A", "B", "C"]) == "abc"
def test_reducer_normalize_should_remove_nonalphanumeric_characters():
assert reducers.normalize(["a+", "?b7", "{c}"]) == "ab7c"
def test_reducer_normalize_should_keep_underscores():
assert reducers.normalize(["a_", "b", "c"]) == "a_bc"
def test_reducer_normalize_should_replace_spaces_with_underscores():
assert reducers.normalize(["a", " b", "c"]) == "a_bc"
PK ! K -# -# tests/test_scrape.pyfrom __future__ import absolute_import, division, print_function, unicode_literals
from pytest import raises
from piculet import reducers, scrape, transformers
def test_no_rules_should_return_empty_result(shining_content):
data = scrape(shining_content, {"items": []})
assert data == {}
def test_extracted_value_should_be_reduced(shining_content):
items = [{"key": "title", "value": {"path": "//title/text()", "reduce": "first"}}]
data = scrape(shining_content, {"items": items})
assert data == {"title": "The Shining"}
def test_default_reducer_should_be_concat(shining_content):
items = [{"key": "full_title", "value": {"path": "//h1//text()"}}]
data = scrape(shining_content, {"items": items})
assert data == {"full_title": "The Shining (1980)"}
def test_added_reducer_should_be_usable(shining_content):
reducers.register("second", lambda x: x[1])
items = [{"key": "year", "value": {"path": "//h1//text()", "reduce": "second"}}]
data = scrape(shining_content, {"items": items})
assert data == {"year": "1980"}
def test_unknown_reducer_should_raise_error(shining_content):
with raises(ValueError):
items = [{"key": "year", "value": {"path": "//h1//text()", "reduce": "foo"}}]
scrape(shining_content, {"items": items})
def test_reduced_value_should_be_transformable(shining_content):
items = [
{"key": "year", "value": {"path": '//span[@class="year"]/text()', "transform": "int"}}
]
data = scrape(shining_content, {"items": items})
assert data == {"year": 1980}
def test_added_transformer_should_be_usable(shining_content):
transformers.register("year25", lambda x: int(x) + 25)
items = [
{
"key": "year",
"value": {"path": '//span[@class="year"]/text()', "transform": "year25"},
}
]
data = scrape(shining_content, {"items": items})
assert data == {"year": 2005}
def test_unknown_transformer_should_raise_error(shining_content):
with raises(ValueError):
items = [
{
"key": "year",
"value": {"path": '//span[@class="year"]/text()', "transform": "year42"},
}
]
scrape(shining_content, {"items": items})
def test_multiple_rules_should_generate_multiple_items(shining_content):
items = [
{"key": "title", "value": {"path": "//title/text()"}},
{"key": "year", "value": {"path": '//span[@class="year"]/text()', "transform": "int"}},
]
data = scrape(shining_content, {"items": items})
assert data == {"title": "The Shining", "year": 1980}
def test_item_with_no_data_should_be_excluded(shining_content):
items = [
{"key": "title", "value": {"path": "//title/text()"}},
{"key": "foo", "value": {"path": "//foo/text()"}},
]
data = scrape(shining_content, {"items": items})
assert data == {"title": "The Shining"}
def test_multivalued_item_should_be_list(shining_content):
items = [
{"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}}
]
data = scrape(shining_content, {"items": items})
assert data == {"genres": ["Horror", "Drama"]}
def test_multivalued_items_should_be_transformable(shining_content):
items = [
{
"key": "genres",
"value": {
"foreach": '//ul[@class="genres"]/li',
"path": "./text()",
"transform": "lower",
},
}
]
data = scrape(shining_content, {"items": items})
assert data == {"genres": ["horror", "drama"]}
def test_empty_values_should_be_excluded_from_multivalued_item_list(shining_content):
items = [
{"key": "foos", "value": {"foreach": '//ul[@class="foos"]/li', "path": "./text()"}}
]
data = scrape(shining_content, {"items": items})
assert data == {}
def test_subrules_should_generate_subitems(shining_content):
items = [
{
"key": "director",
"value": {
"items": [
{"key": "name", "value": {"path": '//div[@class="director"]//a/text()'}},
{"key": "link", "value": {"path": '//div[@class="director"]//a/@href'}},
]
},
}
]
data = scrape(shining_content, {"items": items})
assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}}
def test_multivalued_subrules_should_generate_list_of_subitems(shining_content):
items = [
{
"key": "cast",
"value": {
"foreach": '//table[@class="cast"]/tr',
"items": [
{"key": "name", "value": {"path": "./td[1]/a/text()"}},
{"key": "character", "value": {"path": "./td[2]/text()"}},
],
},
}
]
data = scrape(shining_content, {"items": items})
assert data == {
"cast": [
{"character": "Jack Torrance", "name": "Jack Nicholson"},
{"character": "Wendy Torrance", "name": "Shelley Duvall"},
]
}
def test_subitems_should_be_transformable(shining_content):
transformers.register("stars", lambda x: "%(name)s as %(character)s" % x)
items = [
{
"key": "cast",
"value": {
"foreach": '//table[@class="cast"]/tr',
"items": [
{"key": "name", "value": {"path": "./td[1]/a/text()"}},
{"key": "character", "value": {"path": "./td[2]/text()"}},
],
"transform": "stars",
},
}
]
data = scrape(shining_content, {"items": items})
assert data == {
"cast": ["Jack Nicholson as Jack Torrance", "Shelley Duvall as Wendy Torrance"]
}
def test_key_should_be_generatable_using_path(shining_content):
items = [
{
"foreach": '//div[@class="info"]',
"key": {"path": "./h3/text()"},
"value": {"path": "./p/text()"},
}
]
data = scrape(shining_content, {"items": items})
assert data == {"Language:": "English", "Runtime:": "144 minutes"}
def test_generated_key_should_be_normalizable(shining_content):
items = [
{
"foreach": '//div[@class="info"]',
"key": {"path": "./h3/text()", "reduce": "normalize"},
"value": {"path": "./p/text()"},
}
]
data = scrape(shining_content, {"items": items})
assert data == {"language": "English", "runtime": "144 minutes"}
def test_generated_key_should_be_transformable(shining_content):
items = [
{
"foreach": '//div[@class="info"]',
"key": {"path": "./h3/text()", "reduce": "normalize", "transform": "upper"},
"value": {"path": "./p/text()"},
}
]
data = scrape(shining_content, {"items": items})
assert data == {"LANGUAGE": "English", "RUNTIME": "144 minutes"}
def test_generated_key_none_should_be_excluded(shining_content):
items = [
{
"foreach": '//div[@class="info"]',
"key": {"path": "./foo/text()"},
"value": {"path": "./p/text()"},
}
]
data = scrape(shining_content, {"items": items})
assert data == {}
def test_tree_should_be_preprocessable(shining_content):
pre = [{"op": "set_text", "path": '//ul[@class="genres"]/li', "text": "Foo"}]
items = [
{"key": "genres", "value": {"foreach": '//ul[@class="genres"]/li', "path": "./text()"}}
]
data = scrape(shining_content, {"items": items, "pre": pre})
assert data == {"genres": ["Foo", "Foo"]}
def test_section_should_set_root_for_queries(shining_content):
items = [
{
"key": "director",
"value": {
"section": '//div[@class="director"]//a',
"items": [
{"key": "name", "value": {"path": "./text()"}},
{"key": "link", "value": {"path": "./@href"}},
],
},
}
]
data = scrape(shining_content, {"items": items})
assert data == {"director": {"link": "/people/1", "name": "Stanley Kubrick"}}
def test_section_no_roots_should_return_empty_result(shining_content):
items = [
{
"key": "director",
"value": {
"section": "//foo",
"items": [{"key": "name", "value": {"path": "./text()"}}],
},
}
]
data = scrape(shining_content, {"items": items})
assert data == {}
def test_section_multiple_roots_should_raise_error(shining_content):
with raises(ValueError):
items = [
{
"key": "director",
"value": {
"section": "//div",
"items": [{"key": "name", "value": {"path": "./text()"}}],
},
}
]
scrape(shining_content, {"items": items})
PK ! *?m m tests/test_xpath.pyfrom __future__ import absolute_import, division, print_function, unicode_literals
from piculet import build_tree, xpath
content = 'foobar'
root = build_tree(content)
def test_non_text_queries_should_return_elements():
selected = xpath(root, ".//t1")
assert [s.tag for s in selected] == ["t1", "t1"]
def test_child_text_queries_should_return_strings():
selected = xpath(root, ".//t1/text()")
assert selected == ["foo"]
def test_descendant_text_queries_should_return_strings():
selected = xpath(root, ".//t1//text()")
assert selected == ["foo", "bar"]
def test_attr_queries_should_return_strings():
selected = xpath(root, ".//t1/@a")
assert selected == ["v"]
def test_non_absolute_queries_should_be_ok():
selected = xpath(root, "//t1")
assert [s.tag for s in selected] == ["t1", "t1"]
PK !H$ ( ( piculet-1.0.1.dist-info/entry_points.txtN+I/N.,()*L.I-Vy\\ PK ! + # piculet-1.0.1.dist-info/LICENSE.txt GNU LESSER GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc.
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
This version of the GNU Lesser General Public License incorporates
the terms and conditions of version 3 of the GNU General Public
License, supplemented by the additional permissions listed below.
0. Additional Definitions.
As used herein, "this License" refers to version 3 of the GNU Lesser
General Public License, and the "GNU GPL" refers to version 3 of the GNU
General Public License.
"The Library" refers to a covered work governed by this License,
other than an Application or a Combined Work as defined below.
An "Application" is any work that makes use of an interface provided
by the Library, but which is not otherwise based on the Library.
Defining a subclass of a class defined by the Library is deemed a mode
of using an interface provided by the Library.
A "Combined Work" is a work produced by combining or linking an
Application with the Library. The particular version of the Library
with which the Combined Work was made is also called the "Linked
Version".
The "Minimal Corresponding Source" for a Combined Work means the
Corresponding Source for the Combined Work, excluding any source code
for portions of the Combined Work that, considered in isolation, are
based on the Application, and not on the Linked Version.
The "Corresponding Application Code" for a Combined Work means the
object code and/or source code for the Application, including any data
and utility programs needed for reproducing the Combined Work from the
Application, but excluding the System Libraries of the Combined Work.
1. Exception to Section 3 of the GNU GPL.
You may convey a covered work under sections 3 and 4 of this License
without being bound by section 3 of the GNU GPL.
2. Conveying Modified Versions.
If you modify a copy of the Library, and, in your modifications, a
facility refers to a function or data to be supplied by an Application
that uses the facility (other than as an argument passed when the
facility is invoked), then you may convey a copy of the modified
version:
a) under this License, provided that you make a good faith effort to
ensure that, in the event an Application does not supply the
function or data, the facility still operates, and performs
whatever part of its purpose remains meaningful, or
b) under the GNU GPL, with none of the additional permissions of
this License applicable to that copy.
3. Object Code Incorporating Material from Library Header Files.
The object code form of an Application may incorporate material from
a header file that is part of the Library. You may convey such object
code under terms of your choice, provided that, if the incorporated
material is not limited to numerical parameters, data structure
layouts and accessors, or small macros, inline functions and templates
(ten or fewer lines in length), you do both of the following:
a) Give prominent notice with each copy of the object code that the
Library is used in it and that the Library and its use are
covered by this License.
b) Accompany the object code with a copy of the GNU GPL and this license
document.
4. Combined Works.
You may convey a Combined Work under terms of your choice that,
taken together, effectively do not restrict modification of the
portions of the Library contained in the Combined Work and reverse
engineering for debugging such modifications, if you also do each of
the following:
a) Give prominent notice with each copy of the Combined Work that
the Library is used in it and that the Library and its use are
covered by this License.
b) Accompany the Combined Work with a copy of the GNU GPL and this license
document.
c) For a Combined Work that displays copyright notices during
execution, include the copyright notice for the Library among
these notices, as well as a reference directing the user to the
copies of the GNU GPL and this license document.
d) Do one of the following:
0) Convey the Minimal Corresponding Source under the terms of this
License, and the Corresponding Application Code in a form
suitable for, and under terms that permit, the user to
recombine or relink the Application with a modified version of
the Linked Version to produce a modified Combined Work, in the
manner specified by section 6 of the GNU GPL for conveying
Corresponding Source.
1) Use a suitable shared library mechanism for linking with the
Library. A suitable mechanism is one that (a) uses at run time
a copy of the Library already present on the user's computer
system, and (b) will operate properly with a modified version
of the Library that is interface-compatible with the Linked
Version.
e) Provide Installation Information, but only if you would otherwise
be required to provide such information under section 6 of the
GNU GPL, and only to the extent that such information is
necessary to install and execute a modified version of the
Combined Work produced by recombining or relinking the
Application with a modified version of the Linked Version. (If
you use option 4d0, the Installation Information must accompany
the Minimal Corresponding Source and Corresponding Application
Code. If you use option 4d1, you must provide the Installation
Information in the manner specified by section 6 of the GNU GPL
for conveying Corresponding Source.)
5. Combined Libraries.
You may place library facilities that are a work based on the
Library side by side in a single library together with other library
facilities that are not Applications and are not covered by this
License, and convey such a combined library under terms of your
choice, if you do both of the following:
a) Accompany the combined library with a copy of the same work based
on the Library, uncombined with any other library facilities,
conveyed under the terms of this License.
b) Give prominent notice with the combined library that part of it
is a work based on the Library, and explaining where to find the
accompanying uncombined form of the same work.
6. Revised Versions of the GNU Lesser General Public License.
The Free Software Foundation may publish revised and/or new versions
of the GNU Lesser General Public License from time to time. Such new
versions will be similar in spirit to the present version, but may
differ in detail to address new problems or concerns.
Each version is given a distinguishing version number. If the
Library as you received it specifies that a certain numbered version
of the GNU Lesser General Public License "or any later version"
applies to it, you have the option of following the terms and
conditions either of that published version or of any later version
published by the Free Software Foundation. If the Library as you
received it does not specify a version number of the GNU Lesser
General Public License, you may choose any version of the GNU Lesser
General Public License ever published by the Free Software Foundation.
If the Library as you received it specifies that a proxy can decide
whether future versions of the GNU Lesser General Public License shall
apply, that proxy's public statement of acceptance of any version is
permanent authorization for you to choose that version for the
Library.
PK !H|n-W Y piculet-1.0.1.dist-info/WHEEL
A
н#Z;/"
bFF]xzwK;<*mTֻ0*Ri.4Vm0[H,JPK !HL^ piculet-1.0.1.dist-info/METADATAUs6~_)i,*5^bg=մ[\$%!?2wAG a2E*צJ+gUbVu$vGguY*$8w |EGrV襸3%FV-aNd}My=)Q.+φ g2,:9gegS\ՔI5{lGX*]$F4n![huyѕ;o][q](\#%ƖXIQ!ID0r&Sbb>X-3UcΤ_VfpUg&cvCZ2Õ,o?? bN0gNweB֗=_8Pq ua6ܱ=sN}TY@U kê{| w`#ͻ_C[zmoQ3jLp26szV^CrT(WrPچUP {7BN0t3ւ+r*Fŋ@hOM ·syz+` o?1~oդN6B'rL6>k:*
jD,$MJ|x9OB\8 {zvD>um`架̝)ALډmG3)4Xsojj7Ng*[IC?&#HUrm1gϹNs(k%PFqe-a[T
MO ?DՁخ+-P,r> I^]raȔ t-sgζ͚!#sT\KԐ2bz%: 7~:NI"0ۜ}xVY\*g'm#_PK !H"2 piculet-1.0.1.dist-info/RECORDuɖH}>d2Z0
n8 muWV"7qb?WQOi]?@}b$kn]I
0HVnYj^8nՊެE0!>&aߊF[%yĈ_D mej$,?o._A?C}Ec?4g*Aұ3
Q+&5G`iZͽ4gy36~%