Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide documentation in Readme.md for using hadoop for storage instead of local mode #7

Open
jar349 opened this issue Dec 13, 2016 · 7 comments

Comments

@jar349
Copy link

jar349 commented Dec 13, 2016

I'm using your docker images to create a hadoop cluster (defined in a docker-compose file). Now, I would like to add your hbase image, but it is configured to use local storage.

I could create my own image based on yours with a custom configuration file, or I could mount the config volume and place my own config file there for hbase to read. However, I think there's a simpler path: taking local or hdfs as an argument and doing the "right thing" on the user's behalf.

I am imagining something like command: hbase master local start or command: hbase master hdfs start where the values you'd need to configure site.xml to use hadoop would come from environment variables (-e HDFS_MASTER=<hostname>).

What do you think?

@davidonlaptop
Copy link
Member

I agree with you that the documentation is not clear on how to use the hadoop and hbase docker images together. Using the environment variables is interesting way that fits well with the Docker approach.

You should consider that you may lose data locality with this method. As far I know, Hadoop is not yet docker aware. So if the datanode and regionserver runs in separate containers, they will have different IP addresses and hbase will assume that the 2 services are not local on the same machine. Therefore, the data access may not be optimal.

However many people uses S3 in production, and Hadoop can't figure out data locality with S3 either.

Can you elaborate more on your use case?

@jar349
Copy link
Author

jar349 commented Dec 13, 2016

Use case:

Building a library of compose files that I can, ahem... compose together, a la: https://docs.docker.com/compose/extends/#/multiple-compose-files

I've already got a zookeeper quorum, and I've got a distributed hadoop cluster (using your hadoop image to provide a name node, data note, and secondary name node.

Now I want a set of files that I can compose on top of zookeeper/hadoop: hbase, spark, kylin, etc.

So, this would be for local development and testing. But my goal is to try to mimick a realistic setup, meaning: more than one zk instance, hadoop secondary name node, more than one hbase region server, hbase actually using hadoop instead of local file system, etc.

@dav-ell
Copy link

dav-ell commented Feb 15, 2020

I'd also appreciate this. This is the best hbase docker repo I can find (that works with Thrift), and having this described easily in the README would make this repository immensely powerful. Starting with no knowledge of HBase or HDFS, I'd be able to spin up a near-production-ready HDFS-backed HBase DB in 10 minutes. You have to admit, that's pretty cool.

Don't forget all the students out there coming out of school, getting their feet wet with big data tools, and floundering because of their complexity. This would go a good ways toward helping them.

@davidonlaptop
Copy link
Member

davidonlaptop commented Feb 17, 2020 via email

@dav-ell
Copy link

dav-ell commented Feb 18, 2020

Thanks! I'll see what I can do.

Do you happen to know how to do it already? My progress on Hadoop in Docker has been slow. sequenceiq's is super old, big-data-europe's was giving me errors, and harisekhon's seems to work perfectly, so I was using that. However, trying to connect HBase to it hasn't been straightforward.

I had to change the configuration file (hdfs-site.xml) from the default (which was writing to /tmp) to:

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<configuration>
    <property>
        <name>dfs.replication</name>
        <value>1</value>
    </property>
    <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///data</value>
    </property>
</configuration>

in order for it to write to a new directory (that's easier for me to mount). Then I run it using something like:

docker run -d --name hdfs -p 8042:8042 -p 8088:8088 -p 19888:19888 -p 50070:50070 -p 50075:50075 -v $HOME/hdfs-data:/data -v $HOME/hdfs-site.xml:/hadoop/etc/hadoop/hdfs-site.xml harisekhon/hadoop

After that, I feel pretty confident about HDFS being setup properly. However, to connect HBase to it, the best I've got so far is changing the hdfs url to:

hdfs://ip-of-docker-container:8020/

Does that look right?

@dav-ell
Copy link

dav-ell commented Feb 18, 2020

Actually, that worked. Have any corrections before I add it to the readme?

@dav-ell
Copy link

dav-ell commented Feb 18, 2020

Pull request #10 added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants