Baleen Components
Baleen's objective is simple: given an OPML file of RSS feeds, download all the posts from those feeds and save them to MongoDB storage. While this task seems like it could be easily completed with a single function, once you start integrating the parts of the program, things get more complex. The following component architecture describes how we've put together Baleen:
There are three main parts to the component architecture:
- Interacting with the local disk: importing OPML and exporting a corpus.
- Interacting with the MongoDB storage of posts.
- Fetching data from both the RSS feeds as well as the complete web page.
Additionally there are utilities, configuration, and logging as well as the command line program that uses commis, but those are pretty standard and are not specific to Baleen. In the next sections we'll look at and describe the operation of each of these main blocks of code.
MongoDB Models
The central part of the operation of Baleen revolves around the interaction with MongoDB. Baleen uses mongoengine as an ODM to provide models for inserting documents into collections. There are two primary models:
Feed
: maintains information about an RSS or Atom feed.Post
: a document that has been syndicated by a feed.
Hopefully the relationship is clear: a Feed
is a listing of Post
documents. Our collection objective is the HTML content of a Post
and we use the Feed
to obtain the Post
rather than web scraping.
Note that these models do nothing except manipulate their data store and read and write to the database. Methods for ingestion, wranging, or fetching the full web page wrap their respective models. E.g. you wouldn't do Feed.sync()
to collect the latest RSS feed, instead you would use some Sync
object and pass it a feed: Sync(feed)
.
Ingestion
The ingestion portion of the Baleen service is the most critical and the requirements are as follows:
- On a routine basis, collect and ingest feeds from MongoDB or an OPML file.
- Synchronize feeds by fetching the latest RSS/Atom from their
xmlUrl
. - For each item in the synchronized feed, create and wrangle a post.
- For each post fetch the full HTML from the
htmlUrl
. - Be able to track the start/stop/duration of the ingestion for a set of feeds.
- Be able to track the number of errors, posts ingested.
- Allow no duplicate posts to be added to the database.
In order to synchronize feeds, we use the feedparser library and to fetch documents from the web, we use Requests. A single Ingest
instance takes as input an iterable of feeds from either MongoDB or from an OPML file. When run it maintains two queues: a feed processing queue and a page processing queue (so that it can be threaded or multiprocessed).
Feed processing is performed by a FeedSync
object which takes a single feed as input. The FeedSync
object fetches the RSS via feedparser, and iterates through all posts, wrangling them and saving them to Mongo. The PageWrangler
object takes a post as input, wrangles the data from a variety of feed types, then fetches the complete web page.
Once the Ingest
instance has cleared it's work queue, it logs various information and terminates. Note that the Ingest
instance is responsible for error handling and logging, while the sync and fetch utilities must raise exceptions.
Import and Export
The import utility uses an OPMLReader
to load and parse the OPML file from disk with Beautiful Soup. The OPML file exposes a tree hierarchy or table of contents structure to the feeds where the first level is a "category" and the secondary level is each RSS/Atom feed item. On import, we simply read the OPML file and add any additional feeds to the MongoDB without duplication. This allows us to maintain a single master list of RSS from multiple OPML files.
Note, we've found that the best way to create OPML files is to use the Feedly app, which allows us to organize our feeds. Under their "organize feeds" section, they also have an Export OPML link (and an import OPML link).
The export utility creates a categorized corpus structure ready for NLTK using the MongoExport
class. Each category from the TOC structure of the OPML is a directory in the corpus on disk, then each post is written as an HTML file. The exporter also writes a README file with information about the contents fo the corpus. Key concerns here involve HTML sanitization (removing scripts) and readability (extracting only the text we want to analyze).