We need to a number of message boards and perform sentiment analysis and topic/idea extraction on the threads and replies. We need to crawl 300 to 400 message boards and look for posts related to approximately 1000 different companies.
Since we are new to this type of data collection we are not sure the best way to proceed. Would we crawl the message boards and scrape all posts since the last crawl. Then run queries against the results for each company. Or would we do separate crawls on the message boards for each company as a keyword.
Also, is there a way to set up a crawler that would be able to automatically crawl all of the boards for posts made since the last crawl or would we need to customize a crawler for each board?
Incidentally, we will not inappropriately scrape and use any boards data. We will display only a small portion of the content (one sentence if any at all) and link each entry to its original on the boards from which it came. Also, we will link any extracted ideas to the original posts/threads.
Thank you for any suggestions.
Help get this topic noticed by sharing it on Twitter, Facebook, or email.
EMPLOYEE0There are a variety of ways you can go about this, but I would recommend the following:
1. Group crawls from multiple (10-100) message boards into a single crawl.
2. Build a custom 80app (http://wiki.80legs.com/80apps) that will follow the appropriate links to posts on each message board. Within that 80app, also use a list of keywords that represent the companies you want to track. Whenever you find that keyword, return 100 or more characters surrounding it to get a slice of the post relevant to your needs.
3. Have the 80app also track various date formats on pages crawled (since presumably message boards may use different date representations). If you see a date that is too old, have the 80app stop crawling at that page.
Thanks for the response. One thing I am still not sure of is how to crawl all posts on the selected boards that have been made over a period of time (such as last 24 hrs). I guess it would be possible to create an app that submits each company as a search term then crawl the results.
Maybe the best idea is to create an account and play with the crawler a little to see what is possible.
Here I want to came here to share my knowledge and skill with you.
Collecting data from on-line communities like discussion boards, blogs and social networks.
We uses deep linguistic parsing, statistical natural language processing, and machine learning to analyse your content, extracting semantic meta-data: information about people, places, companies, topics, languages, and more.
Short note about Platform overview.
A single unified platform for all content types (consolidate to reduce development and maintenance costs)
Flexible system which can support any new content type
High automation (cut configuration costs)
Real time coverage or as close as possible for each content type unique
Some of our challenges
Supporting many different types of content
Automatically “understanding” millions of sites with different structures
Over 6000 message boards
Over 90 million blogs
Handling large quantities of data
Over 2 million new messages per day
Over 1 million new blog posts per day
Supporting data in different languages
Attached sample for your perusal, Please review and give me you valuable feedback.
We are pleasure to share some more samples, If you need from your input.