Diff() Between Two Collections In Mongodb
Solution 1:
A couple of suggestions:
You could use a combination of url and the date accessed (at least part of the datetime object) as the _id for these objects since from what I can tell you plan to scrape each url once a month.
Example:
{
"_id": {
"url": "www.google.com",
"date": ISODate("2013-03-01"),
},
// Other attributes
}
This yields performance, uniqueness, and query dividends (see this 4sq blog post). You could query doing something like:
db.collection.find({
"_id": {
"$gte": {
"url": yourUrl,
"date": rangeStart
},
"$lt": {
"url": yourUrl,
"date": rangeEnd
},
}
})
Which yields excellent, nicely sorted (by url THEN by date, which seems to be just what you want) results. You could also use this index to perform covered queries (over the _id field) if you just want a nice set of all of the urls and months you have scraped (this could set you up nicely to go through each url one at a time).
If you have specific attributes of the document that you're interested in comparing (headers.server
for example) and a specific comparison you want to do for them (looking for any increment in version numbers for example), I would use some kind of regex to grab the elements relevant to version number (a quick and dirty one might simply retrieve all numeric elements) and graph them for each url (I assume this would let you visualize changes to server software over time). You could just as easily report whenever any of these attributes changed by scanning them in order and setting off some event when the strings were not identical (perhaps then reporting the change or the numerical piece of the change).
Post a Comment for "Diff() Between Two Collections In Mongodb"