tag:blogger.com,1999:blog-7583720.post4793213116691900681..comments2024-03-05T03:17:02.289-08:00Comments on Salmon Run: Crawling web pages with WebSPHINXSujit Palhttp://www.blogger.com/profile/06835223352394332155noreply@blogger.comBlogger30125tag:blogger.com,1999:blog-7583720.post-38211074877908776062014-04-18T12:33:29.042-07:002014-04-18T12:33:29.042-07:00I guess you could if you had full links inside you...I guess you could if you had full links inside your script sections, but since Javascript is a programming language and you can represent URLs as other than plain string fields, its not guaranteed.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-43724233689370283892014-04-18T04:35:57.473-07:002014-04-18T04:35:57.473-07:00Hi,can i download javascript links also using this...Hi,can i download javascript links also using this crawler?<br />Thanks in advanceAnonymoushttps://www.blogger.com/profile/16610201231671315935noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-38717332231801615382013-02-20T12:55:49.399-08:002013-02-20T12:55:49.399-08:00Ah, got it, thanks! :-).
Ah, got it, thanks! :-).<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-38633873372699877902013-02-20T04:28:07.668-08:002013-02-20T04:28:07.668-08:00I'm referring to your class MyCrawler; if you ...I'm referring to your class MyCrawler; if you write like this<br /><br />...<br />dp.changeObeyRobotExclusion(true);<br />...<br /><br />nothing happens;<br /><br />I want you to note that the correct sintax (in order to make something happen :) ) is<br /><br />dp = dp.changeObeyRobotExclusion(true);<br /><br />as you are trying to modify a Java.String.<br /><br />cheers again,<br /><br />Francesconoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-17544573618831316922013-02-19T11:20:01.303-08:002013-02-19T11:20:01.303-08:00Thanks for your kind words, Francesco, and thank y...Thanks for your kind words, Francesco, and thank you for the information - I did not know about DownloadParameters but it looks like a simple and useful way of filtering web pages.<br />Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-36646954232198926292013-02-19T01:24:46.995-08:002013-02-19T01:24:46.995-08:00a tip: if you use DownloadParameters.changeXXX() m...a tip: if you use DownloadParameters.changeXXX() methods there will be no results indeed; DPs are like Java.Strings in this implementation, and you have to use them like this:<br /><br />dp = dp.changeXXX()<br /><br />anyway, thanks a lot for your samples, they introduced me into websphinx!<br /><br />cheers, francescofrancesconoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-71124682520414398542012-03-28T16:00:08.160-07:002012-03-28T16:00:08.160-07:00Hi propovip, if you want to do this with WebSphinx...Hi propovip, if you want to do this with WebSphinx, you can extend visit(Page) to extract the links in the page (page.getLinks() and store it in say a database table as (parent, child). Then you could write something that reads this table and produces a report, starting from the root of your crawl and proceeding recursively down to each child, until the current child has no children of its own.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-32653550449178322582012-03-28T01:52:29.847-07:002012-03-28T01:52:29.847-07:00Ihave been asked to crawl the links my application...Ihave been asked to crawl the links my application has. Is there a way to generate a report with hierarchy.<br /><br />For Ex :- If i crawl a link "X" whose parent is "A" and "A"s parent is "B". i should see the report as <br />Page B URL --> Page A URL--> Page X URL<br />How do i achieve this?propoviphttps://www.blogger.com/profile/08435439580835615318noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-40163845183049667042011-05-19T13:37:26.184-07:002011-05-19T13:37:26.184-07:00Thanks, Nguyễn Anh Đức. Not sure I follow your que...Thanks, Nguyễn Anh Đức. Not sure I follow your question completely, but generally the "Previous page" would be enclosed in the A tag, something like: <a href="www.link1.com">Previous</a>, so you could parse the first page looking for the A tags. There are various HTML parsers available, but my personal (and probably biased, since I use it exclusively for HTML Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-58174555795044605392011-05-11T03:25:52.699-07:002011-05-11T03:25:52.699-07:00Dear Sujit Pal !
thanks for your arctice. It's...Dear Sujit Pal !<br />thanks for your arctice. It's useful for me. I have a question, how do i get title of a tags of previous Page<br /> example: previous Page: < a href="http://link1.com">title 1< a/>.<br />When i crawl http://link1.com page, i get string "title 1"<br /><br />Sorry about my Englishsomewherepeacehttps://www.blogger.com/profile/04370311786777306595noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-34866379752429407162010-12-29T15:04:34.156-08:002010-12-29T15:04:34.156-08:00I think you are missing the MyCrawler class from y...I think you are missing the MyCrawler class from your classpath. See the first error:<br />...UrlHarvestingCrawler.java:13: cannot find symbol<br />symbol: class MyCrawler<br /><br />From there its complaining about methods defined in MyCrawler that UrlHarvestingCrawler overrides. You can cut out the middleman by extending Crawler directly and extending the methods like I did in MyCrawler.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-71047899742767108942010-12-28T02:51:29.520-08:002010-12-28T02:51:29.520-08:00Hi Sujit,
how do you compile some of the java file...Hi Sujit,<br />how do you compile some of the java files you have created with javac and the websphinx classes.<br /><br />I'm new with java and when I try some errors occurs like:<br />C:\Users\Alberto\Desktop\websphinx>javac mycrawler\UrlHarvestingCrawler.java<br />mycrawler\UrlHarvestingCrawler.java:13: cannot find symbol<br />symbol: class MyCrawler<br />public class Albertonoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-2508993791139231382010-12-04T14:49:00.260-08:002010-12-04T14:49:00.260-08:00Thanks teckk. Not sure if there is direct support ...Thanks teckk. Not sure if there is direct support for what you want to do, but you could probably do this like so: (1) instantiate a Set<String> in your crawler. (2) if the Set is empty or contains the URL, shouldVisit(URL) should return true. (3) in visit(Page) you should do page.getContent() and parse out the links that you care about and put them in the Set.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-76189223389136376112010-12-03T20:12:22.910-08:002010-12-03T20:12:22.910-08:00Hi Sujit,
Very nice posts. I was just wondering ...Hi Sujit,<br /><br />Very nice posts. I was just wondering if you knew how to specify in shouldVisit() how to make it only download links that are in a certain HTML element. For example I only want to go to links that are in a certain list, do you have any idea how one would do this? Thanks.teckknoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-76022254935086008552010-11-14T19:31:36.469-08:002010-11-14T19:31:36.469-08:00Hi Mike, I have never used the WebSphinx workbench...Hi Mike, I have never used the WebSphinx workbench tool, so not very sure, sorry. My usage has mainly been to quickly use it to build ad-hoc crawlers for a single site. Using that approach, you could add multiple seed URLs using multiple Crawler.addRoot().Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-9367267663989229282010-11-08T04:33:56.655-08:002010-11-08T04:33:56.655-08:00Hi Sujit, I would like to know how to connect even...Hi Sujit, I would like to know how to connect event listener to monitor the crawler. Instead of setting the root to be crawled, I prefer to input my own set of URLs during runtime using the Workbench. <br /><br />So, the questions are: <br />How to connect event lister to monitor the crawler?<br />How to get the input from the Workbench?<br /><br />Thanks for your help.<br /><br />--<br />Mikemikenoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-46103528338083268432010-09-16T15:01:25.249-07:002010-09-16T15:01:25.249-07:00Hi Usman, as far as I understood it, Websphinx wil...Hi Usman, as far as I understood it, Websphinx will internally call visit(Page page) for each page it crawls - so once you set a root page and start the crawl, it will automatically follow the links on the page. If you make your visit(Page page) just capture the URL of the visited page, you should have what you want.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-66690033720891304062010-09-15T03:34:04.629-07:002010-09-15T03:34:04.629-07:00Hy Sujit
I want to use websphinx in my project ,...Hy Sujit <br /><br />I want to use websphinx in my project , but im still not sure how to use it. i copy/paste mycrawler.java & sitedownloadingcrawler.java in a package.<br /><br />wat i want to do is .that ik takes a link from user and returns all the sub-sequent links in a file.<br /><br />I dont want to download huge pages,bcoz they will eat up all memory very quickly.<br /><br />It will Usmanhttps://www.blogger.com/profile/03569480311179475438noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-15011261246884348852010-08-06T16:49:47.717-07:002010-08-06T16:49:47.717-07:00Thanks Chirag. To answer your question, I don'...Thanks Chirag. To answer your question, I don't think you can control the order unless you force websphinx to run in a single thread...<br /><br />If the order is required because you want to present the URL and the HTML together, then it may make sense to write your HTML and URL at the same time, and use that file as your input.<br /><br />Alternatively, you could consider storing the Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-14360453393920660612010-07-31T15:16:51.312-07:002010-07-31T15:16:51.312-07:00Hi Sujit,
Thanks for your wonderful posts. They a...Hi Sujit,<br /><br />Thanks for your wonderful posts. They are very informative and are leading me down the right path!<br /><br />I am currently using the WebSPHINX java program to crawl a list of web pages and extract the data contained between specific tags.<br /><br />At the moment, the crawler outputs the data to an HTML which creates an ordered list with the extracted values. The values Chiragnoreply@blogger.comtag:blogger.com,1999:blog-7583720.post-76775350131187640032010-06-18T12:26:01.149-07:002010-06-18T12:26:01.149-07:00Hi Le Van, I tried going to the /GL/Home url you s...Hi Le Van, I tried going to the /GL/Home url you sent (with my browser), then moused over few of the links. It looks like the links are pointing to /GL/*, ie, /GL/Home is a sibling of all the links in it. If you are only interested in the /GL/Home links, then you should filter the URL in the code.Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-80061574547881583082010-06-13T05:33:18.840-07:002010-06-13T05:33:18.840-07:00I have a big problem!!!
I set root http://vnexpres...I have a big problem!!!<br />I set root http://vnexpress.net/GL/Home/ (with domain vnexpress.net)<br />but I only get urls with domain mvnexpress.net???<br />--------------<br />Please try using root http://vnexpress.net/GL/Home/<br />and give me ideal to sholve this problem.<br />Thanks!Unknownhttps://www.blogger.com/profile/15573010884411174361noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-31359872692446426662010-03-27T12:07:05.354-07:002010-03-27T12:07:05.354-07:00Hi Troy, I am just another user of WebSPHINX. Its ...Hi Troy, I am just another user of WebSPHINX. Its small and the API is quite easy to understand - my post details some of my understanding of it. Did you have specific questions about stuff you don't understand?Sujit Palhttps://www.blogger.com/profile/06835223352394332155noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-88361075678535345452010-03-26T05:02:12.868-07:002010-03-26T05:02:12.868-07:00hi tis is raja.i had a trouble in executing sourc...hi tis is raja.i had a trouble in executing source code ofwebsphinx.i would like u to help me with required explanations and steps to be known.kindly reply to my mail.troyhttps://www.blogger.com/profile/06005245757695946135noreply@blogger.comtag:blogger.com,1999:blog-7583720.post-2000495071275187382010-03-26T02:12:23.502-07:002010-03-26T02:12:23.502-07:00master guruji.... i m arun....i juz now learning d...master guruji.... i m arun....i juz now learning data mining paper for my project....i have to work on WEBSPHINX ....but i don't know how to work on it...i downloaded websphinx but don't where to start with.....so please provide materials and steps to work on websphinx......troyhttps://www.blogger.com/profile/06005245757695946135noreply@blogger.com